Executive Summary
Overall workflow health for the last 24 hours is healthy and stable. Of 390 runs with a recorded outcome, 384 succeeded and 6 failed — a 1.54% failure rate, which is at or slightly below the 7-day baseline of 1.95% (54 / 2,768 runs). This is normal behavior, not a regression.
The highest-signal reliability concern is observability instrumentation, not runtime behavior: several core attributes used for failure analysis are missing or unindexed in the spans dataset, and the errors and logs datasets are completely empty for the window. Runtime outcomes are only recoverable through the gh-aw.run.status attribute; span.status, gen_ai.response.finish_reasons, and release are null across the board. As a result, truncation / runaway-token detection is inconclusive for this window.
All 6 failures came from the copilot engine (also the highest-volume engine), spread across 3 workflows.
Top Reliability Findings
| Priority |
Workflow |
Problem |
Evidence |
Next Action |
| P2 |
PR Sous Chef |
3 failed runs (most of today's failures) |
gh-aw.run.status:failure, 3 distinct traces; e.g. b9907154... |
Inspect the 3 run logs; check for a shared failure mode |
| P2 |
Issue Monster |
2 failed runs |
2 distinct traces; e.g. 3c20bfad... (run-level failure, mixed sub-step outcomes) |
Review run logs for the failing step |
| P3 |
Daily Hippo Learn |
1 failed run |
1 distinct trace |
Spot-check; likely one-off |
| P4 |
PR Sous Chef |
Latency outlier inside a failed run: one gen_ai span ran 391.5s (~6.5 min) vs typical 30–55s |
span f6123b39a1a47df6, max 391,497 ms, trace b9907154... |
Confirm whether the long model call caused/accompanied the failure (possible timeout) |
| P3 |
All workflows |
Instrumentation gap: span.status, gen_ai.response.finish_reasons, release null across all 17,602 spans; errors + logs datasets empty |
aggregate count() per field = null; finish_reasons:length filter returns 0 rows |
Fix emit-side → backend mapping (see Recommendations) |
Note: "failed runs" counts distinct traces (count_unique(trace)), not raw failure spans — a single failed run propagates the failure status onto several child spans (24 failure spans collapse to 6 runs).
Representative Traces
View representative traces
PR Sous Chef — failed run with latency outlier
- Trace:
b9907154a0c80b7659146dff67d3dcca
- Continuity verified: trace contains a coherent span tree (
gen_ai, multiple http.client ~10s, http.server). Run-level gh-aw.run.status = failure.
- Outlier:
gen_ai span f6123b39a1a47df6 ran 391,497 ms (~6.5 min) — the largest single span observed in the window. Surrounding gen_ai spans in the same run were 36–55s.
- Second PR Sous Chef failure trace:
b862e649d88f8cceecec5593fafe9313
Issue Monster — failed run, mixed sub-step outcomes
- Trace:
3c20bfadb4a90cdcfbd56da0d4accfa7
- Continuity verified: 75 spans (36
http.server, 20 default, 13 http.client, 10 gen_ai). Within the run, gen_ai sub-steps were mixed (4 failure, 2 success, 4 unset) but the run-level outcome is failure.
Truncation / finish_reasons:length
- Query
gen_ai.response.finish_reasons:length over 24h returned 0 rows — see Notes; this is an instrumentation/indexing gap, not evidence of "no truncation".
Recommendations
Smallest useful fixes first:
- Map runtime outcome to a queryable status field. OTLP
status.code (ERROR=2) is emitted on the conclusion span (send_otlp_span.cjs builds status: { code }), but Sentry's span.status is null for all 17,602 spans. Until the backend maps OTLP status → span.status, dashboards/alerts should key off gh-aw.run.status (works: success/failure), not span.status.
- Make
gen_ai.response.finish_reasons queryable. It is emitted as an OTLP array attribute (buildArrayAttr(... [effectiveStopReason]), send_otlp_span.cjs:2092). Array-valued span attributes aren't indexed for filter/group-by here, so truncation/length can't be detected. Consider also emitting a scalar mirror (e.g. gen_ai.response.finish_reason single string) for query/alerting.
- Restore
release correlation. release is null for all spans. Identity is emitted as resource attribute service.version, and only when the version is known (scopeVersion && scopeVersion !== "unknown", send_otlp_span.cjs:323-325). Verify the agent job actually sets a real version and confirm the Sentry service.version → release mapping; without it, regression-by-release analysis is impossible.
- Confirm errors/logs export path. The
errors and logs datasets returned 0 events for 24h. If those signals are expected to flow to this project, the exporter/routing should be checked; if they're intentionally not exported, document that so empty datasets aren't mistaken for "no problems."
Notes
View notes
Tooling / environment
- The referenced skill file
skills/otel-queries/SKILL.md does not exist in the repo — followed the embedded query loop instead.
- This Sentry MCP build does not expose
search_events or get_trace_details. Fell back to list_events (spans/errors/logs datasets) and verified trace continuity via trace:<id> filtered queries, as instructed.
max()/p95() aggregations on span.duration returned empty in this build; latency evidence was taken from sorted raw spans (sort=-span.duration) instead — the 391.5s value is a directly observed span duration.
Confirmed instrumentation gaps (separate from runtime failures)
errors dataset: 0 events (24h).
logs dataset: 0 events (24h).
span.status: null for all 17,602 spans.
gen_ai.response.finish_reasons: null for all 3,496 gen_ai spans; finish_reasons:length filter → 0 rows (array attribute, not indexed).
release: null for all spans.
- Attribute-name caveat: workflow identity is emitted as
gh-aw.workflow.name (dotted). The variant gh_aw.workflow_name is null for all 17,616 spans — queries must use the dotted form.
gh-aw.otlp.export_* / gh-aw.run.error attributes were null → no OTLP export failures surfaced this window (good).
Inconclusive (do not over-claim)
- Truncation / runaway token usage: inconclusive. Cannot be confirmed or ruled out because
gen_ai.response.finish_reasons is unindexed and token-usage aggregations were unavailable. Reported as confirmed instrumentation gap, not a confirmed truncation event.
- The 391.5s
gen_ai span co-occurs with a failed run but causation (timeout vs. unrelated failure) is not confirmed without span.status / finish-reason data.
Engine distribution (24h, by span volume): copilot 2,230 · claude 717 · codex 338 · gemini 64 · pi 40 · antigravity 40. All 6 failing runs used copilot — consistent with it being the highest-volume engine, so not necessarily an engine-specific defect.
Scope: spans dataset, project github/gh-aw, statsPeriod=24h, org github (us region). Totals: 17,602 spans; 390 runs with a recorded outcome.
References:
Generated by 🚨 Daily Reliability Review · agent 173.9 AIC · threat-detection 12.2 AIC · ◷
Executive Summary
Overall workflow health for the last 24 hours is healthy and stable. Of 390 runs with a recorded outcome, 384 succeeded and 6 failed — a 1.54% failure rate, which is at or slightly below the 7-day baseline of 1.95% (54 / 2,768 runs). This is normal behavior, not a regression.
The highest-signal reliability concern is observability instrumentation, not runtime behavior: several core attributes used for failure analysis are missing or unindexed in the spans dataset, and the
errorsandlogsdatasets are completely empty for the window. Runtime outcomes are only recoverable through thegh-aw.run.statusattribute;span.status,gen_ai.response.finish_reasons, andreleaseare null across the board. As a result, truncation / runaway-token detection is inconclusive for this window.All 6 failures came from the copilot engine (also the highest-volume engine), spread across 3 workflows.
Top Reliability Findings
gh-aw.run.status:failure, 3 distinct traces; e.g.b9907154...3c20bfad...(run-level failure, mixed sub-step outcomes)gen_aispan ran 391.5s (~6.5 min) vs typical 30–55sf6123b39a1a47df6, max 391,497 ms, traceb9907154...span.status,gen_ai.response.finish_reasons,releasenull across all 17,602 spans;errors+logsdatasets emptycount()per field = null;finish_reasons:lengthfilter returns 0 rowsRepresentative Traces
View representative traces
PR Sous Chef — failed run with latency outlier
b9907154a0c80b7659146dff67d3dccagen_ai, multiplehttp.client~10s,http.server). Run-levelgh-aw.run.status = failure.gen_aispanf6123b39a1a47df6ran 391,497 ms (~6.5 min) — the largest single span observed in the window. Surroundinggen_aispans in the same run were 36–55s.b862e649d88f8cceecec5593fafe9313Issue Monster — failed run, mixed sub-step outcomes
3c20bfadb4a90cdcfbd56da0d4accfa7http.server, 20default, 13http.client, 10gen_ai). Within the run,gen_aisub-steps were mixed (4failure, 2success, 4 unset) but the run-level outcome isfailure.Truncation /
finish_reasons:lengthgen_ai.response.finish_reasons:lengthover 24h returned 0 rows — see Notes; this is an instrumentation/indexing gap, not evidence of "no truncation".Recommendations
Smallest useful fixes first:
status.code(ERROR=2) is emitted on the conclusion span (send_otlp_span.cjsbuildsstatus: { code }), but Sentry'sspan.statusis null for all 17,602 spans. Until the backend maps OTLP status →span.status, dashboards/alerts should key offgh-aw.run.status(works: success/failure), notspan.status.gen_ai.response.finish_reasonsqueryable. It is emitted as an OTLP array attribute (buildArrayAttr(... [effectiveStopReason]),send_otlp_span.cjs:2092). Array-valued span attributes aren't indexed for filter/group-by here, so truncation/length can't be detected. Consider also emitting a scalar mirror (e.g.gen_ai.response.finish_reasonsingle string) for query/alerting.releasecorrelation.releaseis null for all spans. Identity is emitted as resource attributeservice.version, and only when the version is known (scopeVersion && scopeVersion !== "unknown",send_otlp_span.cjs:323-325). Verify the agent job actually sets a real version and confirm the Sentryservice.version → releasemapping; without it, regression-by-release analysis is impossible.errorsandlogsdatasets returned 0 events for 24h. If those signals are expected to flow to this project, the exporter/routing should be checked; if they're intentionally not exported, document that so empty datasets aren't mistaken for "no problems."Notes
View notes
Tooling / environment
skills/otel-queries/SKILL.mddoes not exist in the repo — followed the embedded query loop instead.search_eventsorget_trace_details. Fell back tolist_events(spans/errors/logs datasets) and verified trace continuity viatrace:<id>filtered queries, as instructed.max()/p95()aggregations onspan.durationreturned empty in this build; latency evidence was taken from sorted raw spans (sort=-span.duration) instead — the 391.5s value is a directly observed span duration.Confirmed instrumentation gaps (separate from runtime failures)
errorsdataset: 0 events (24h).logsdataset: 0 events (24h).span.status: null for all 17,602 spans.gen_ai.response.finish_reasons: null for all 3,496gen_aispans;finish_reasons:lengthfilter → 0 rows (array attribute, not indexed).release: null for all spans.gh-aw.workflow.name(dotted). The variantgh_aw.workflow_nameis null for all 17,616 spans — queries must use the dotted form.gh-aw.otlp.export_*/gh-aw.run.errorattributes were null → no OTLP export failures surfaced this window (good).Inconclusive (do not over-claim)
gen_ai.response.finish_reasonsis unindexed and token-usage aggregations were unavailable. Reported as confirmed instrumentation gap, not a confirmed truncation event.gen_aispan co-occurs with a failed run but causation (timeout vs. unrelated failure) is not confirmed withoutspan.status/ finish-reason data.Engine distribution (24h, by span volume): copilot 2,230 · claude 717 · codex 338 · gemini 64 · pi 40 · antigravity 40. All 6 failing runs used copilot — consistent with it being the highest-volume engine, so not necessarily an engine-specific defect.
Scope: spans dataset, project
github/gh-aw,statsPeriod=24h, orggithub(us region). Totals: 17,602 spans; 390 runs with a recorded outcome.References: