docs(devnet-debug): harden log/severity triage, Dora access, and runbook flow#124
docs(devnet-debug): harden log/severity triage, Dora access, and runbook flow#124parithosh wants to merge 5 commits into
Conversation
…ook flow Improvements to the devnet debugging runbooks and the query skill, driven by running them against live devnets (glamsterdam-devnet-4, bal-devnet-7): - Sandbox session reuse: create one session and reuse it across steps so /workspace (and the debug report) persists and the 10-session limit isn't hit. - sync_distance interpretation: head_slot + sync_distance ~= wall-clock slot, to distinguish a healthy propagation spread from a real split / stalled finality. - Dora-tolerance: keep Dora as the primary starting point but wrap calls so a 500/panic (e.g. integer-divide-by-zero on degraded networks) falls through to the RPC baseline instead of aborting. - Severity anchoring: match the UPPERCASE level token (or logfmt level=error) case-sensitively and exclude DEBUG/TRACE, instead of substring-matching "error" (which returned ~28k benign DEBUG hits on a healthy network). - ANSI stripping: strip terminal colour codes in a bounded `clean` CTE before severity matching (colour-wrapped tokens otherwise hid ~49% of real errors); warn against stripped-Body regex over wide/historical windows (bypasses the idx_body skip-index, S3 beyond 7d). Use raw strings so \b/\x1b survive. - Bootnode exclusion: default `host.name != 'bootnode-1'` for cross-host triage. - Dora /forks split detection: the 403 is Cloudflare Error 1010 (browser integrity), not auth — direct HTTP from the sandbox needs a browser User-Agent. Fixed the query skill note, the runbook step, and the three UA-less httpx examples in modules/dora/examples.yaml. - Loki references reframed as deployment-specific (ethpandaops devnets ship logs to ClickHouse external.otel_logs; a loki module still exists for deployments that enable it) in the query skill and the sandbox package docstring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Condense the added runbook/skill prose (session reuse, Dora-tolerance, sync_distance, ANSI/severity, bootnode, Dora UA) without dropping any technical content — these files load into LLM context, so fewer tokens. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tro logs example The intro 'Logs' example hardcoded a stale devnet (fusaka-devnet-0) and a made-up host, and was overloaded into a bounded ANSI-strip severity query. Make it a plain recent-lines fetch (network + Timestamp + LIMIT, no host) with a <network> placeholder; the severity + bounded-host pattern stays in the 'Severity triage' section where it belongs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… cmd form Audit of the debugging-path examples found several stale points that would derail an investigation (all verified against the live API): - dora/examples.yaml: get_network_overview has no active_validator_count or finalized_epoch (actual keys: current_epoch, current_slot, finalized [bool], participation_rate). Fixed the network-overview KeyError, rewrote the 'check finality' example (finalized is a bool, not an epoch — scan get_epoch for the last finalized epoch), and corrected get_epoch participation (globalparticipationrate) and get_validator activationepoch. - ethnode/examples.yaml: replaced the long-dead dencun-devnet-12 network and its instance names with <network>/<instance> placeholders + a discovery hint. - runbooks: normalize 'panda datasources list' -> 'panda datasources' (list was silently ignored). Note: the get_validators 'by status' example is also broken (invalid status value + empty results); left as-is pending a closer look rather than guessing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…unbook-improvements # Conflicts: # .claude/skills/query/SKILL.md
🤖 qu0b-reviewerThe bug is confirmed. Let me write it up precisely. SummaryThe PR hardens log/severity triage in runbooks, adds Dora User-Agent headers for Cloudflare compatibility, refreshes ethnode examples, and tightens prose. A subsequent commit in this PR introduced a breaking field-name typo in the Dora validator example that will cause the LLM to reason from Issues
Reviewed @ |
|
Thanks for the review — but this 🔴 finding is a false positive, and applying the suggested change would reintroduce the bug it's trying to fix. I verified against the live Dora API (not the source) via So:
The original line was Dora's JSON uses lowercased, no-underscore keys throughout — the same convention behind the other field corrections in this PR, all verified live on
Reasoning from |
| AND match(clean, '(^|[][ |])(CRIT|ERRO|ERROR|FATAL|PANIC)($|[][ |:])|^(ERR|FAT)\b|\blevel=(crit|error|fatal|panic)\b') | ||
| AND NOT match(clean, '(^|[][ |])(DEBUG|DBG|TRACE|TRC)($|[][ |:])|\blevel=(debug|trace)\b') | ||
| ... LIMIT 200 | ||
| ``` |
There was a problem hiding this comment.
seems odd to have something like this in a runbook
Summary
Improves the devnet debugging runbooks and related examples so they match how the live ethpandaops devnet data is actually exposed: Dora for chain state, ClickHouse for hosted/local OTel logs, and ethnode/RPC as the fallback when Dora is unavailable or unhealthy.
Rationale
external.otel_logs, not Loki, so LogQL is not the query path for these runbooks.SeverityNumber/SeverityTextare populated for many rows, but some real non-bootnode CL error lines still have empty structured severity and only carryERRORinBody.Bodyparsing only as a fallback.finalizedas the current epoch's boolean, which led examples to hardcode network-specific workarounds. The server now exposesfinalized_epochdirectly.What changed
Bodyfallback only when needed.LIMIT; bootnode is excluded from hosted cross-host triage by default.match(Body, '(?i)error')patterns.finalized_epoch,finalized_slot, andepochs_since_finality; simplified Dora examples so they use advertised networks instead of hardcodedsepolia.finalizedboolean is not confused with the finalized epoch.Validation
clickhouse-raw.external.otel_logsseverity coverage for recent hosted logs; coverage is mixed across networks and includes real empty-severity CL error rows.panda datasources.go test ./runbooks ./modules/clickhouse ./pkg/server ./pkg/cli ./modules/dora