Skip to content

docs(devnet-debug): harden log/severity triage, Dora access, and runbook flow#124

Open
parithosh wants to merge 5 commits into
masterfrom
docs/devnet-debug-runbook-improvements
Open

docs(devnet-debug): harden log/severity triage, Dora access, and runbook flow#124
parithosh wants to merge 5 commits into
masterfrom
docs/devnet-debug-runbook-improvements

Conversation

@parithosh

@parithosh parithosh commented Jun 4, 2026

Copy link
Copy Markdown
Member

Summary

Improves the devnet debugging runbooks and related examples so they match how the live ethpandaops devnet data is actually exposed: Dora for chain state, ClickHouse for hosted/local OTel logs, and ethnode/RPC as the fallback when Dora is unavailable or unhealthy.

Rationale

  • Hosted devnet container logs are available through ClickHouse external.otel_logs, not Loki, so LogQL is not the query path for these runbooks.
  • Live OTel severity coverage is mixed. SeverityNumber / SeverityText are populated for many rows, but some real non-bootnode CL error lines still have empty structured severity and only carry ERROR in Body.
  • The docs now prefer structured severity first, require a quick coverage check for the target slice, and use bounded Body parsing only as a fallback.
  • Dora overview previously exposed finalized as the current epoch's boolean, which led examples to hardcode network-specific workarounds. The server now exposes finalized_epoch directly.

What changed

  • Reworked devnet/local-devnet log triage to use structured OTel severity fields first, with a scoped Body fallback only when needed.
  • Added severity coverage checks to the runbooks so agents can prove whether fallback parsing is necessary for the target host/service.
  • Kept broad cross-host log scans bounded by network/enclave, host/service, time window, and LIMIT; bootnode is excluded from hosted cross-host triage by default.
  • Updated ClickHouse examples so semantic search does not keep suggesting broad match(Body, '(?i)error') patterns.
  • Updated Dora overview to include finalized_epoch, finalized_slot, and epochs_since_finality; simplified Dora examples so they use advertised networks instead of hardcoded sepolia.
  • Updated CLI overview labels so the raw finalized boolean is not confused with the finalized epoch.

Validation

  • Queried live clickhouse-raw.external.otel_logs severity coverage for recent hosted logs; coverage is mixed across networks and includes real empty-severity CL error rows.
  • Verified LogQL is not available for the hosted devnet container-log path via panda datasources.
  • Added server coverage for Dora overview finalized epoch handling.
  • go test ./runbooks ./modules/clickhouse ./pkg/server ./pkg/cli ./modules/dora

parithosh and others added 5 commits June 4, 2026 12:54
…ook flow

Improvements to the devnet debugging runbooks and the query skill, driven by
running them against live devnets (glamsterdam-devnet-4, bal-devnet-7):

- Sandbox session reuse: create one session and reuse it across steps so
  /workspace (and the debug report) persists and the 10-session limit isn't hit.
- sync_distance interpretation: head_slot + sync_distance ~= wall-clock slot, to
  distinguish a healthy propagation spread from a real split / stalled finality.
- Dora-tolerance: keep Dora as the primary starting point but wrap calls so a
  500/panic (e.g. integer-divide-by-zero on degraded networks) falls through to
  the RPC baseline instead of aborting.
- Severity anchoring: match the UPPERCASE level token (or logfmt level=error)
  case-sensitively and exclude DEBUG/TRACE, instead of substring-matching
  "error" (which returned ~28k benign DEBUG hits on a healthy network).
- ANSI stripping: strip terminal colour codes in a bounded `clean` CTE before
  severity matching (colour-wrapped tokens otherwise hid ~49% of real errors);
  warn against stripped-Body regex over wide/historical windows (bypasses the
  idx_body skip-index, S3 beyond 7d). Use raw strings so \b/\x1b survive.
- Bootnode exclusion: default `host.name != 'bootnode-1'` for cross-host triage.
- Dora /forks split detection: the 403 is Cloudflare Error 1010 (browser
  integrity), not auth — direct HTTP from the sandbox needs a browser
  User-Agent. Fixed the query skill note, the runbook step, and the three
  UA-less httpx examples in modules/dora/examples.yaml.
- Loki references reframed as deployment-specific (ethpandaops devnets ship logs
  to ClickHouse external.otel_logs; a loki module still exists for deployments
  that enable it) in the query skill and the sandbox package docstring.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Condense the added runbook/skill prose (session reuse, Dora-tolerance,
sync_distance, ANSI/severity, bootnode, Dora UA) without dropping any
technical content — these files load into LLM context, so fewer tokens.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tro logs example

The intro 'Logs' example hardcoded a stale devnet (fusaka-devnet-0) and a
made-up host, and was overloaded into a bounded ANSI-strip severity query.
Make it a plain recent-lines fetch (network + Timestamp + LIMIT, no host) with
a <network> placeholder; the severity + bounded-host pattern stays in the
'Severity triage' section where it belongs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… cmd form

Audit of the debugging-path examples found several stale points that would
derail an investigation (all verified against the live API):

- dora/examples.yaml: get_network_overview has no active_validator_count or
  finalized_epoch (actual keys: current_epoch, current_slot, finalized [bool],
  participation_rate). Fixed the network-overview KeyError, rewrote the
  'check finality' example (finalized is a bool, not an epoch — scan get_epoch
  for the last finalized epoch), and corrected get_epoch participation
  (globalparticipationrate) and get_validator activationepoch.
- ethnode/examples.yaml: replaced the long-dead dencun-devnet-12 network and
  its instance names with <network>/<instance> placeholders + a discovery hint.
- runbooks: normalize 'panda datasources list' -> 'panda datasources' (list was
  silently ignored).

Note: the get_validators 'by status' example is also broken (invalid status
value + empty results); left as-is pending a closer look rather than guessing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…unbook-improvements

# Conflicts:
#	.claude/skills/query/SKILL.md
@qu0b-reviewer

qu0b-reviewer Bot commented Jun 4, 2026

Copy link
Copy Markdown

🤖 qu0b-reviewer

The bug is confirmed. Let me write it up precisely.


Summary

The PR hardens log/severity triage in runbooks, adds Dora User-Agent headers for Cloudflare compatibility, refreshes ethnode examples, and tightens prose. A subsequent commit in this PR introduced a breaking field-name typo in the Dora validator example that will cause the LLM to reason from N/A values.

Issues

  • 🔴 modules/dora/examples.yaml:92validator.get('activationepoch', 'N/A') reads a key that doesn't exist in Dora's /api/v1/validator/%s response. dora.get_validator is a transparent passthrough (handleDoraDataGetPassthrough → raw Dora JSON, verified at pkg/server/operations_dora.go:30,172–201). Dora's validator endpoint returns activation_epoch in snake_case. The code will silently fall through to N/A, making every activation-epoch report read "N/A" without indication of failure. The N/A default faculty in this context is likely masking a real empty/null response — but even if that's intentional, the key is wrong.

    Correct line (matching Dora's actual API surface):

    print(f"  Activation epoch: {validator.get('activation_epoch', 'N/A')}")

Reviewed @ 86858328
"Beware of bugs in the above code; I have only proved it correct, not tried it." — Donald Knuth

@parithosh

Copy link
Copy Markdown
Member Author

Thanks for the review — but this 🔴 finding is a false positive, and applying the suggested change would reintroduce the bug it's trying to fix.

I verified against the live Dora API (not the source) via dora.get_validator("hoodi", "1"):

has 'activationepoch'  (no underscore): True  -> 0
has 'activation_epoch' (snake_case)   : False -> None

So:

  • activationepoch (what this PR changed the line to) is the real key — present.
  • activation_epoch (the snake_case suggested above) does not exist in the response → it is the one that silently falls through to N/A.

The original line was activation_epoch; this PR's change to activationepoch is the fix. The suggested "correct line" would revert it.

Dora's JSON uses lowercased, no-underscore keys throughout — the same convention behind the other field corrections in this PR, all verified live on hoodi:

call key used present
get_validator activationepoch ✅ (activation_epoch ❌)
get_epoch globalparticipationrate ✅ (validator_participation ❌)
get_network_overview current_epoch, current_slot, finalized, participation_rate ✅ (active_validator_count, finalized_epoch ❌)

Reasoning from handleDoraDataGetPassthrough is right that it's a transparent passthrough — but the upstream Dora JSON keys are lowercased-concatenated, not snake_case, so the passthrough returns activationepoch.

Comment thread modules/dora/examples.yaml
AND match(clean, '(^|[][ |])(CRIT|ERRO|ERROR|FATAL|PANIC)($|[][ |:])|^(ERR|FAT)\b|\blevel=(crit|error|fatal|panic)\b')
AND NOT match(clean, '(^|[][ |])(DEBUG|DBG|TRACE|TRC)($|[][ |:])|\blevel=(debug|trace)\b')
... LIMIT 200
```

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems odd to have something like this in a runbook

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be cleaner now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants