[reliability] Daily Reliability Review - 2026-06-05

### Executive Summary

Overall workflow health for the last 24 hours is **healthy and stable**. Of **390 runs** with a recorded outcome, **384 succeeded and 6 failed** — a **1.54% failure rate**, which is *at or slightly below* the 7-day baseline of **1.95% (54 / 2,768 runs)**. This is **normal behavior, not a regression**.

The highest-signal reliability concern is **observability instrumentation**, not runtime behavior: several core attributes used for failure analysis are missing or unindexed in the spans dataset, and the `errors` and `logs` datasets are completely empty for the window. Runtime outcomes are only recoverable through the `gh-aw.run.status` attribute; `span.status`, `gen_ai.response.finish_reasons`, and `release` are null across the board. As a result, **truncation / runaway-token detection is inconclusive** for this window.

All 6 failures came from the **copilot** engine (also the highest-volume engine), spread across 3 workflows.

### Top Reliability Findings

| Priority | Workflow | Problem | Evidence | Next Action |
| --- | --- | --- | --- | --- |
| P2 | PR Sous Chef | 3 failed runs (most of today's failures) | `gh-aw.run.status:failure`, 3 distinct traces; e.g. `b9907154...` | Inspect the 3 run logs; check for a shared failure mode |
| P2 | Issue Monster | 2 failed runs | 2 distinct traces; e.g. `3c20bfad...` (run-level failure, mixed sub-step outcomes) | Review run logs for the failing step |
| P3 | Daily Hippo Learn | 1 failed run | 1 distinct trace | Spot-check; likely one-off |
| P4 | PR Sous Chef | Latency outlier inside a failed run: one `gen_ai` span ran **391.5s (~6.5 min)** vs typical 30–55s | span `f6123b39a1a47df6`, max 391,497 ms, trace `b9907154...` | Confirm whether the long model call caused/accompanied the failure (possible timeout) |
| P3 | All workflows | **Instrumentation gap**: `span.status`, `gen_ai.response.finish_reasons`, `release` null across all 17,602 spans; `errors` + `logs` datasets empty | aggregate `count()` per field = null; `finish_reasons:length` filter returns 0 rows | Fix emit-side → backend mapping (see Recommendations) |

> Note: "failed runs" counts **distinct traces** (`count_unique(trace)`), not raw failure spans — a single failed run propagates the `failure` status onto several child spans (24 failure spans collapse to 6 runs).

### Representative Traces

<details>
<summary>View representative traces</summary>

**PR Sous Chef — failed run with latency outlier**
- Trace: [`b9907154a0c80b7659146dff67d3dcca`](https://github.sentry.io/explore/traces/trace/b9907154a0c80b7659146dff67d3dcca)
- Continuity verified: trace contains a coherent span tree (`gen_ai`, multiple `http.client` ~10s, `http.server`). Run-level `gh-aw.run.status = failure`.
- Outlier: `gen_ai` span `f6123b39a1a47df6` ran **391,497 ms (~6.5 min)** — the largest single span observed in the window. Surrounding `gen_ai` spans in the same run were 36–55s.
- Second PR Sous Chef failure trace: [`b862e649d88f8cceecec5593fafe9313`](https://github.sentry.io/explore/traces/trace/b862e649d88f8cceecec5593fafe9313)

**Issue Monster — failed run, mixed sub-step outcomes**
- Trace: [`3c20bfadb4a90cdcfbd56da0d4accfa7`](https://github.sentry.io/explore/traces/trace/3c20bfadb4a90cdcfbd56da0d4accfa7)
- Continuity verified: 75 spans (36 `http.server`, 20 `default`, 13 `http.client`, 10 `gen_ai`). Within the run, `gen_ai` sub-steps were mixed (4 `failure`, 2 `success`, 4 unset) but the **run-level outcome is `failure`**.

**Truncation / `finish_reasons:length`**
- Query `gen_ai.response.finish_reasons:length` over 24h returned **0 rows** — see Notes; this is an instrumentation/indexing gap, not evidence of "no truncation".

</details>

### Recommendations

Smallest useful fixes first:

1. **Map runtime outcome to a queryable status field.** OTLP `status.code` (ERROR=2) is emitted on the conclusion span (`send_otlp_span.cjs` builds `status: { code }`), but Sentry's `span.status` is null for all 17,602 spans. Until the backend maps OTLP status → `span.status`, **dashboards/alerts should key off `gh-aw.run.status`** (works: success/failure), not `span.status`.
2. **Make `gen_ai.response.finish_reasons` queryable.** It is emitted as an OTLP **array** attribute (`buildArrayAttr(... [effectiveStopReason])`, `send_otlp_span.cjs:2092`). Array-valued span attributes aren't indexed for filter/group-by here, so truncation/length can't be detected. Consider **also emitting a scalar** mirror (e.g. `gen_ai.response.finish_reason` single string) for query/alerting.
3. **Restore `release` correlation.** `release` is null for all spans. Identity is emitted as resource attribute `service.version`, and only when the version is known (`scopeVersion && scopeVersion !== "unknown"`, `send_otlp_span.cjs:323-325`). Verify the agent job actually sets a real version and confirm the Sentry `service.version → release` mapping; without it, regression-by-release analysis is impossible.
4. **Confirm errors/logs export path.** The `errors` and `logs` datasets returned **0 events** for 24h. If those signals are expected to flow to this project, the exporter/routing should be checked; if they're intentionally not exported, document that so empty datasets aren't mistaken for "no problems."

### Notes

<details>
<summary>View notes</summary>

**Tooling / environment**
- The referenced skill file `skills/otel-queries/SKILL.md` does **not exist** in the repo — followed the embedded query loop instead.
- This Sentry MCP build does **not** expose `search_events` or `get_trace_details`. Fell back to `list_events` (spans/errors/logs datasets) and verified trace continuity via `trace:<id>` filtered queries, as instructed.
- `max()/p95()` aggregations on `span.duration` returned empty in this build; latency evidence was taken from sorted raw spans (`sort=-span.duration`) instead — the 391.5s value is a directly observed span duration.

**Confirmed instrumentation gaps (separate from runtime failures)**
- `errors` dataset: **0 events** (24h).
- `logs` dataset: **0 events** (24h).
- `span.status`: **null for all 17,602 spans**.
- `gen_ai.response.finish_reasons`: **null for all 3,496 `gen_ai` spans**; `finish_reasons:length` filter → 0 rows (array attribute, not indexed).
- `release`: **null for all spans**.
- Attribute-name caveat: workflow identity is emitted as **`gh-aw.workflow.name`** (dotted). The variant `gh_aw.workflow_name` is null for all 17,616 spans — queries must use the dotted form.
- `gh-aw.otlp.export_*` / `gh-aw.run.error` attributes were null → **no OTLP export failures surfaced** this window (good).

**Inconclusive (do not over-claim)**
- **Truncation / runaway token usage: inconclusive.** Cannot be confirmed or ruled out because `gen_ai.response.finish_reasons` is unindexed and token-usage aggregations were unavailable. Reported as *confirmed instrumentation gap*, not a confirmed truncation event.
- The 391.5s `gen_ai` span co-occurs with a failed run but causation (timeout vs. unrelated failure) is **not** confirmed without `span.status` / finish-reason data.

**Engine distribution (24h, by span volume):** copilot 2,230 · claude 717 · codex 338 · gemini 64 · pi 40 · antigravity 40. All 6 failing runs used **copilot** — consistent with it being the highest-volume engine, so not necessarily an engine-specific defect.

**Scope:** spans dataset, project `github/gh-aw`, `statsPeriod=24h`, org `github` (us region). Totals: 17,602 spans; 390 runs with a recorded outcome.

**References:**
- [§27045012306](https://github.com/github/gh-aw/actions/runs/27045012306)

</details>







> Generated by [🚨 Daily Reliability Review](https://github.com/github/gh-aw/actions/runs/27045012306) · agent 173.9 AIC · threat-detection 12.2 AIC · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fdaily-reliability-review%22&type=issues)
> - [x] expires  on Jun 7, 2026, 11:22 PM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[reliability] Daily Reliability Review - 2026-06-05 #37221

Executive Summary

Top Reliability Findings

Representative Traces

Recommendations

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Priority	Workflow	Problem	Evidence	Next Action
P2	PR Sous Chef	3 failed runs (most of today's failures)	`gh-aw.run.status:failure`, 3 distinct traces; e.g. `b9907154...`	Inspect the 3 run logs; check for a shared failure mode
P2	Issue Monster	2 failed runs	2 distinct traces; e.g. `3c20bfad...` (run-level failure, mixed sub-step outcomes)	Review run logs for the failing step
P3	Daily Hippo Learn	1 failed run	1 distinct trace	Spot-check; likely one-off
P4	PR Sous Chef	Latency outlier inside a failed run: one `gen_ai` span ran 391.5s (~6.5 min) vs typical 30–55s	span `f6123b39a1a47df6`, max 391,497 ms, trace `b9907154...`	Confirm whether the long model call caused/accompanied the failure (possible timeout)
P3	All workflows	Instrumentation gap: `span.status`, `gen_ai.response.finish_reasons`, `release` null across all 17,602 spans; `errors` + `logs` datasets empty	aggregate `count()` per field = null; `finish_reasons:length` filter returns 0 rows	Fix emit-side → backend mapping (see Recommendations)

[reliability] Daily Reliability Review - 2026-06-05 #37221

Description

Executive Summary

Top Reliability Findings

Representative Traces

Recommendations

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions