From 96e24c14794390cb4ace1577fd022058b6c0d0f5 Mon Sep 17 00:00:00 2001 From: Thanatat Tamtan Date: Sun, 21 Jun 2026 22:32:17 +0700 Subject: [PATCH 1/2] error: document the error.* module + error.create reporting Rename the error-detection actions to the new error.* module across the docs (deployment.errors -> error.list, deployment.errorGet -> error.get, deployment.errorUpdate -> error.update) and update the permission gating: reads by error.list / error.get, triage by error.update, all previously deployment.logs. Document the new error.create capability: a running deployment or SDK reports its own application errors directly; a reported error and a log-mined trace with the same stack signature merge into one issue (shared fingerprint). Covers the request shape (events[] with required type, kind defaulting to generic, frames[].func/file/line, batch up to 100), auth (service-account key OR a me.generateToken scoped token attenuated to error.create), the secret-safety note (notifications carry only the exception type, never title/sample), and a concrete curl example. Adds the error report CLI command and error.create MCP tool to the reference pages. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_011d4bVuGLnCbcJD9ZvastPH --- content/api/conventions.md | 3 +- content/automation/mcp.md | 8 +- content/deployments/error-detection.md | 133 ++++++++++++++++++++----- 3 files changed, 112 insertions(+), 32 deletions(-) diff --git a/content/api/conventions.md b/content/api/conventions.md index af42326..33e75e3 100644 --- a/content/api/conventions.md +++ b/content/api/conventions.md @@ -85,7 +85,8 @@ The big picture. Each row is a fully-qualified API function. | `deployment.rollback` | Re-apply an older revision as a new revision | | `deployment.revisions` | History of revisions | | `deployment.metrics` | CPU / mem / requests / egress time-series | -| `deployment.errors` / `.errorGet` / `.errorUpdate` | [Application error issues](/deployments/error-detection/) — list, fetch, and triage | +| `error.list` / `.get` / `.update` | [Application error issues](/deployments/error-detection/) — list, fetch, and triage | +| `error.create` | [Report an application error](/deployments/error-detection/#reporting-errors-from-your-app--errorcreate) directly from a running deployment | ### Routing diff --git a/content/automation/mcp.md b/content/automation/mcp.md index 30349eb..f7a83c3 100644 --- a/content/automation/mcp.md +++ b/content/automation/mcp.md @@ -175,10 +175,10 @@ All three are read-only and return once (no streaming). See [Monitoring & debugging](/deployments/monitoring/#reading-logs-and-status-programmatically) for the contract and the `deployment.logs` permission split. -For triaging recurring crashes, **`deployment.errors`** and -**`deployment.errorGet`** pull up the deployment's grouped, deduplicated error -**issues** — each with an occurrence count and a representative stack — so the -assistant can read what's been throwing without scrolling raw logs. See +For triaging recurring crashes, **`error.list`** and **`error.get`** pull up the +deployment's grouped, deduplicated error **issues** — each with an occurrence count +and a representative stack — so the assistant can read what's been throwing without +scrolling raw logs, and **`error.create`** lets it report an error directly. See [application error detection](/deployments/error-detection/). ## Permissions & safety diff --git a/content/deployments/error-detection.md b/content/deployments/error-detection.md index a11a388..a93e949 100644 --- a/content/deployments/error-detection.md +++ b/content/deployments/error-detection.md @@ -2,8 +2,8 @@ title: 'Application error detection' linkTitle: 'Error detection' weight: 9 -description: 'Automatic Sentry-lite error tracking — deploys.app mines your logs for stack traces and groups them into deduplicated issues with counts, a triage lifecycle, and notifications.' -lead: 'Deploys.app reads your deployment''s durable logs for application-level stack traces — Go panics, Java/Python/Node/Ruby exceptions, plus a generic fallback — and groups identical traces into deduplicated issues. Each issue carries an occurrence count, first/last-seen, a representative stack, and recent occurrences, with an open → resolved → reopened triage lifecycle. There is nothing to instrument inside your container.' +description: 'Automatic Sentry-lite error tracking — deploys.app mines your logs for stack traces, or lets your app report errors directly, and groups them into deduplicated issues with counts, a triage lifecycle, and notifications.' +lead: 'Deploys.app reads your deployment''s durable logs for application-level stack traces — Go panics, Java/Python/Node/Ruby exceptions, plus a generic fallback — and groups identical traces into deduplicated issues. Your app can also report its own errors directly via `error.create`, and a reported error merges with a log-mined trace of the same signature into one issue. Each issue carries an occurrence count, first/last-seen, a representative stack, and recent occurrences, with an open → resolved → reopened triage lifecycle. There is nothing you must instrument inside your container.' --- ## What it is @@ -30,7 +30,7 @@ infrastructure and pod layers: |---|---|---| | [`deployment.health`](/automation/notification-channels/#asynchronous-failures-deploymenthealth) / auto-error | infra | no running pods, a deployer apply failure | | [`deployment.status`](/deployments/monitoring/#reading-logs-and-status-programmatically) | pod | crash-loops, OOM-kills, pod conditions | -| **this** — `deployment.errors` | **application** | **stack traces in your log output** | +| **this** — `error.*` | **application** | **stack traces in your log output, or errors your app reports itself** | {{< callout type="note" >}} **Only stack traces become issues.** A lone `ERROR` or `FATAL` log line — one with @@ -58,7 +58,11 @@ freshest line, read the [logs](/deployments/monitoring/) directly. Identical traces are grouped by a fingerprint computed from the stack frames — the function names and files, not the jittery line numbers or the free-text message — so the same bug firing a thousand times across every replica is **one** issue with -`count: 1000`, not a thousand rows. +`count: 1000`, not a thousand rows. The same fingerprint is shared across both +sources: an error your app [reports directly](#reporting-errors-from-your-app--errorcreate) +with `error.create` and a trace the platform mines from your logs land in the **same +issue** when their stack signatures match, so reporting doesn't double-count what the +log miner would have caught anyway. ## Triage lifecycle @@ -87,7 +91,7 @@ every non-static deployment. - The **issue detail** shows the full representative stack and the recent occurrences, each linking back to that moment in the deployment's log history. **Resolve**, **Mute**, and **Reopen** buttons drive the - [lifecycle](#triage-lifecycle); they're gated by the `deployment.logs` permission. + [lifecycle](#triage-lifecycle); they're gated by the `error.update` permission. - When a deployment has never thrown, the tab reads *"No application errors detected."* @@ -115,19 +119,21 @@ deploys notification create --project acme --name app-errors \ {{< callout type="note" >}} The notification message carries only the exception **type** (e.g. `panic`, `java.lang.NullPointerException`) and a `new error:` / `error regressed:` reason — -**never** the full title or the stack. An app's error message can embed secrets it -logged, and a notification payload must stay secret-free. The full title and the -sample stack live behind the `deployment.logs` permission, in the issue itself. +**never** the full title or the sample/stack. An app's error message can embed +secrets it logged, and a notification payload must stay secret-free. The full title +and the sample stack live behind the `error.get` permission, in the issue itself. {{< /callout >}} ## API -Three actions back the Errors tab. All are gated by the **`deployment.logs`** -permission — the same one that reads [logs](/deployments/monitoring/#permissions) — -because an issue's stack carries the same secret-bearing `stdout`. They reject -`Static` deployments, which have no logs to mine. +The `error.*` module backs the Errors tab. Reads are gated by their own +permissions — `error.list` to list issues and `error.get` to fetch an issue with its +stack — because a stack carries the same secret-bearing `stdout` as the +[logs](/deployments/monitoring/#permissions). Triage is gated by **`error.update`**, +and direct reporting by **`error.create`**. All reject `Static` deployments, which +have no logs to mine. -### `deployment.errors` — list issues +### `error.list` — list issues | Param | Description | |---|---| @@ -144,33 +150,33 @@ firstSeen, lastSeen, samplePod }` — plus a `nextCursor` until the list is exhausted. ```bash -curl https://api.deploys.app/deployment.errors \ +curl https://api.deploys.app/error.list \ -H "Authorization: Bearer $DEPLOYS_TOKEN" \ -d '{ "project": "acme", "location": "gke.cluster-rcf2", "name": "web", "status": "open", "sort": "count" }' ``` -### `deployment.errorGet` — one issue, with the stack +### `error.get` — one issue, with the stack | Param | Description | |---|---| | `project` | The project id. | | `location` | The deployment's location. | | `name` | The deployment name. | -| `id` | The issue id from `deployment.errors`. | +| `id` | The issue id from `error.list`. | Returns the issue with its `sampleMessage` (the full representative stack) and `recentEvents[]` — each `{ pod, timestamp, object, offset }` pointing at an occurrence in the captured log history. ```bash -curl https://api.deploys.app/deployment.errorGet \ +curl https://api.deploys.app/error.get \ -H "Authorization: Bearer $DEPLOYS_TOKEN" \ -d '{ "project": "acme", "location": "gke.cluster-rcf2", "name": "web", "id": "…issue id…" }' ``` -### `deployment.errorUpdate` — triage +### `error.update` — triage | Param | Description | |---|---| @@ -185,12 +191,83 @@ Flips an issue's [status](#triage-lifecycle). Setting `resolved` marks it fixed; ```bash # mark an issue resolved -curl https://api.deploys.app/deployment.errorUpdate \ +curl https://api.deploys.app/error.update \ -H "Authorization: Bearer $DEPLOYS_TOKEN" \ -d '{ "project": "acme", "location": "gke.cluster-rcf2", "name": "web", "id": "…issue id…", "status": "resolved" }' ``` +## Reporting errors from your app — `error.create` + +Log mining catches what your app **prints**. But some errors you'd rather report +explicitly — a handled exception you recover from, an error that never reaches +`stderr`, or one you want to enrich with structured frames from your own SDK. +`error.create` lets a running deployment (or an SDK embedded in it) report its own +application errors directly, instead of relying only on log mining. + +A reported error and a log-mined trace with the **same stack signature merge into one +issue** — they share the same [fingerprint](#how-it-works) — so reporting and mining +reinforce each other rather than double-counting. + +### Request shape + +| Field | Description | +|---|---| +| `project` | The project id. | +| `location` | The deployment's location. | +| `name` | The deployment that's reporting (the deployment name). | +| `events` | The batch of error events — **up to 100** per call. | + +Each entry in `events[]`: + +| Field | Description | +|---|---| +| `type` | **Required.** The exception type, e.g. `panic`, `java.lang.NullPointerException`. This is the only field that ever appears in a notification. | +| `kind` | One of `go`, `java`, `python`, `node`, `ruby`, `generic`. Defaults to `generic`. | +| `title` | A short human title for the issue. | +| `frames` | The stack frames — each `{ func, file, line }`. The fingerprint is computed from these, so they decide which issue the event merges into. | +| `sample` | A representative full stack/message string for the issue detail. | +| `pod` | The reporting pod name. | +| `ts` | The occurrence timestamp. | + +### Auth + +The reporting app authenticates as an identity that holds the `error.create` +permission — pick whichever fits how your workload already authenticates, no new +infrastructure either way: + +- a project **[service-account key](/access/service-accounts/)** — the same kind of + key your CI or [MCP](/automation/mcp/) local mode uses; or +- a **scoped token** from `me.generateToken` attenuated to `error.create`. This is + the least-privilege option: the token can do nothing but report errors, and you can + keep it short-lived. + +{{< callout type="note" >}} +The [notification](#notifications) for a new or regressed issue carries only the +exception **`type`** — **never** the `title` or `sample`, which can echo application +data your app put in the error. Titles and samples stay behind the `error.get` +permission, in the issue itself. +{{< /callout >}} + +```bash +# a running app reports one handled exception +curl https://api.deploys.app/error.create \ + -H "Authorization: Bearer $DEPLOYS_TOKEN" \ + -d '{ "project": "acme", "location": "gke.cluster-rcf2", "name": "web", + "events": [ + { "kind": "go", + "type": "panic", + "title": "runtime error: invalid memory address or nil pointer dereference", + "frames": [ + { "func": "main.(*Handler).Serve", "file": "handler.go", "line": 142 }, + { "func": "net/http.(*conn).serve", "file": "server.go", "line": 2092 } + ], + "sample": "panic: runtime error: invalid memory address or nil pointer dereference\n\tmain.(*Handler).Serve(...)\n\t\thandler.go:142", + "pod": "web-7d9c8b6f4-abcde", + "ts": "2026-06-21T10:04:00Z" } + ] }' +``` + ### Kinds The `kind` field tells you which runtime threw, and drives the icon in the console: @@ -206,16 +283,18 @@ The `kind` field tells you which runtime threw, and drives the icon in the conso ## From the CLI and AI assistants -The same three actions are available outside the console: +The `error.*` actions are available outside the console too: -- The **CLI** surfaces them under `deploys deployment errors` (list, get, and - resolve), so a script or CI job can read and triage issues without the console. -- The **[MCP server](/automation/mcp/)** exposes the error-listing and - error-detail actions, so an AI assistant can pull up a deployment's open issues - and read the stack as part of a diagnose-and-fix loop. +- The **CLI** surfaces them under `deploys error` — `list`, `get`, and `update` to + read and triage issues, plus `deploys error report` to send an `error.create` + event from a script or CI job without the console. +- The **[MCP server](/automation/mcp/)** exposes the error-listing and error-detail + tools (and an `error.create` tool), so an AI assistant can pull up a deployment's + open issues, read the stack, and even report errors as part of a + diagnose-and-fix loop. -Both wrap the same `deployment.logs`-gated API, so they can only see what your -identity is allowed to. +Both wrap the same `error.*` API, so they can only see and do what your identity is +allowed to. ## Retention From 96c41f3a4b584bf26b3a1e5110e5cd99a3d30e3a Mon Sep 17 00:00:00 2001 From: Thanatat Tamtan Date: Sun, 21 Jun 2026 22:34:19 +0700 Subject: [PATCH 2/2] error-detection: notification event is error.detected, not deployment.error The api PR renamed the new/regressed-issue change event from deployment.error to error.detected (resource error, action detected) as part of the error.* module. Update the Notifications section's event name, the subscribe example, and the wildcard hint (error.* not deployment.*). Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_011d4bVuGLnCbcJD9ZvastPH --- content/deployments/error-detection.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/content/deployments/error-detection.md b/content/deployments/error-detection.md index a93e949..bf2b8b8 100644 --- a/content/deployments/error-detection.md +++ b/content/deployments/error-detection.md @@ -97,15 +97,15 @@ every non-static deployment. ## Notifications -A **new** issue, or a **resolved** issue that **regresses**, fires a -[`deployment.error`](/automation/notification-channels/) change event. Like every +A **new** issue, or a **resolved** issue that **regresses**, fires an +[`error.detected`](/automation/notification-channels/) change event. Like every change event, it's delivered to the project's configured [notification channels](/automation/notification-channels/) — a webhook, a Discord channel, or a pull queue. Only those two state transitions fire, so a recurring error doesn't re-notify on every occurrence, and a **muted** issue never fires at all. -Subscribe to `deployment.error` (or the wildcard `deployment.*`) on a channel to +Subscribe to `error.detected` (or the wildcard `error.*`) on a channel to route application errors where your team will see them: ```bash @@ -113,7 +113,7 @@ route application errors where your team will see them: deploys notification create --project acme --name app-errors \ --type discord \ --url https://discord.com/api/webhooks/123/abc \ - --event deployment.error + --event error.detected ``` {{< callout type="note" >}}