Skip to content

[reliability] Daily Reliability Review - 2026-06-05 #37221

@github-actions

Description

@github-actions

Executive Summary

Overall workflow health for the last 24 hours is healthy and stable. Of 390 runs with a recorded outcome, 384 succeeded and 6 failed — a 1.54% failure rate, which is at or slightly below the 7-day baseline of 1.95% (54 / 2,768 runs). This is normal behavior, not a regression.

The highest-signal reliability concern is observability instrumentation, not runtime behavior: several core attributes used for failure analysis are missing or unindexed in the spans dataset, and the errors and logs datasets are completely empty for the window. Runtime outcomes are only recoverable through the gh-aw.run.status attribute; span.status, gen_ai.response.finish_reasons, and release are null across the board. As a result, truncation / runaway-token detection is inconclusive for this window.

All 6 failures came from the copilot engine (also the highest-volume engine), spread across 3 workflows.

Top Reliability Findings

Priority Workflow Problem Evidence Next Action
P2 PR Sous Chef 3 failed runs (most of today's failures) gh-aw.run.status:failure, 3 distinct traces; e.g. b9907154... Inspect the 3 run logs; check for a shared failure mode
P2 Issue Monster 2 failed runs 2 distinct traces; e.g. 3c20bfad... (run-level failure, mixed sub-step outcomes) Review run logs for the failing step
P3 Daily Hippo Learn 1 failed run 1 distinct trace Spot-check; likely one-off
P4 PR Sous Chef Latency outlier inside a failed run: one gen_ai span ran 391.5s (~6.5 min) vs typical 30–55s span f6123b39a1a47df6, max 391,497 ms, trace b9907154... Confirm whether the long model call caused/accompanied the failure (possible timeout)
P3 All workflows Instrumentation gap: span.status, gen_ai.response.finish_reasons, release null across all 17,602 spans; errors + logs datasets empty aggregate count() per field = null; finish_reasons:length filter returns 0 rows Fix emit-side → backend mapping (see Recommendations)

Note: "failed runs" counts distinct traces (count_unique(trace)), not raw failure spans — a single failed run propagates the failure status onto several child spans (24 failure spans collapse to 6 runs).

Representative Traces

View representative traces

PR Sous Chef — failed run with latency outlier

  • Trace: b9907154a0c80b7659146dff67d3dcca
  • Continuity verified: trace contains a coherent span tree (gen_ai, multiple http.client ~10s, http.server). Run-level gh-aw.run.status = failure.
  • Outlier: gen_ai span f6123b39a1a47df6 ran 391,497 ms (~6.5 min) — the largest single span observed in the window. Surrounding gen_ai spans in the same run were 36–55s.
  • Second PR Sous Chef failure trace: b862e649d88f8cceecec5593fafe9313

Issue Monster — failed run, mixed sub-step outcomes

  • Trace: 3c20bfadb4a90cdcfbd56da0d4accfa7
  • Continuity verified: 75 spans (36 http.server, 20 default, 13 http.client, 10 gen_ai). Within the run, gen_ai sub-steps were mixed (4 failure, 2 success, 4 unset) but the run-level outcome is failure.

Truncation / finish_reasons:length

  • Query gen_ai.response.finish_reasons:length over 24h returned 0 rows — see Notes; this is an instrumentation/indexing gap, not evidence of "no truncation".

Recommendations

Smallest useful fixes first:

  1. Map runtime outcome to a queryable status field. OTLP status.code (ERROR=2) is emitted on the conclusion span (send_otlp_span.cjs builds status: { code }), but Sentry's span.status is null for all 17,602 spans. Until the backend maps OTLP status → span.status, dashboards/alerts should key off gh-aw.run.status (works: success/failure), not span.status.
  2. Make gen_ai.response.finish_reasons queryable. It is emitted as an OTLP array attribute (buildArrayAttr(... [effectiveStopReason]), send_otlp_span.cjs:2092). Array-valued span attributes aren't indexed for filter/group-by here, so truncation/length can't be detected. Consider also emitting a scalar mirror (e.g. gen_ai.response.finish_reason single string) for query/alerting.
  3. Restore release correlation. release is null for all spans. Identity is emitted as resource attribute service.version, and only when the version is known (scopeVersion && scopeVersion !== "unknown", send_otlp_span.cjs:323-325). Verify the agent job actually sets a real version and confirm the Sentry service.version → release mapping; without it, regression-by-release analysis is impossible.
  4. Confirm errors/logs export path. The errors and logs datasets returned 0 events for 24h. If those signals are expected to flow to this project, the exporter/routing should be checked; if they're intentionally not exported, document that so empty datasets aren't mistaken for "no problems."

Notes

View notes

Tooling / environment

  • The referenced skill file skills/otel-queries/SKILL.md does not exist in the repo — followed the embedded query loop instead.
  • This Sentry MCP build does not expose search_events or get_trace_details. Fell back to list_events (spans/errors/logs datasets) and verified trace continuity via trace:<id> filtered queries, as instructed.
  • max()/p95() aggregations on span.duration returned empty in this build; latency evidence was taken from sorted raw spans (sort=-span.duration) instead — the 391.5s value is a directly observed span duration.

Confirmed instrumentation gaps (separate from runtime failures)

  • errors dataset: 0 events (24h).
  • logs dataset: 0 events (24h).
  • span.status: null for all 17,602 spans.
  • gen_ai.response.finish_reasons: null for all 3,496 gen_ai spans; finish_reasons:length filter → 0 rows (array attribute, not indexed).
  • release: null for all spans.
  • Attribute-name caveat: workflow identity is emitted as gh-aw.workflow.name (dotted). The variant gh_aw.workflow_name is null for all 17,616 spans — queries must use the dotted form.
  • gh-aw.otlp.export_* / gh-aw.run.error attributes were null → no OTLP export failures surfaced this window (good).

Inconclusive (do not over-claim)

  • Truncation / runaway token usage: inconclusive. Cannot be confirmed or ruled out because gen_ai.response.finish_reasons is unindexed and token-usage aggregations were unavailable. Reported as confirmed instrumentation gap, not a confirmed truncation event.
  • The 391.5s gen_ai span co-occurs with a failed run but causation (timeout vs. unrelated failure) is not confirmed without span.status / finish-reason data.

Engine distribution (24h, by span volume): copilot 2,230 · claude 717 · codex 338 · gemini 64 · pi 40 · antigravity 40. All 6 failing runs used copilot — consistent with it being the highest-volume engine, so not necessarily an engine-specific defect.

Scope: spans dataset, project github/gh-aw, statsPeriod=24h, org github (us region). Totals: 17,602 spans; 390 runs with a recorded outcome.

References:

Generated by 🚨 Daily Reliability Review · agent 173.9 AIC · threat-detection 12.2 AIC ·

  • expires on Jun 7, 2026, 11:22 PM UTC

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions