Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion content/api/conventions.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,8 @@ The big picture. Each row is a fully-qualified API function.
| `deployment.rollback` | Re-apply an older revision as a new revision |
| `deployment.revisions` | History of revisions |
| `deployment.metrics` | CPU / mem / requests / egress time-series |
| `deployment.errors` / `.errorGet` / `.errorUpdate` | [Application error issues](/deployments/error-detection/) — list, fetch, and triage |
| `error.list` / `.get` / `.update` | [Application error issues](/deployments/error-detection/) — list, fetch, and triage |
| `error.create` | [Report an application error](/deployments/error-detection/#reporting-errors-from-your-app--errorcreate) directly from a running deployment |

### Routing

Expand Down
8 changes: 4 additions & 4 deletions content/automation/mcp.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,10 +175,10 @@ All three are read-only and return once (no streaming). See
[Monitoring & debugging](/deployments/monitoring/#reading-logs-and-status-programmatically)
for the contract and the `deployment.logs` permission split.

For triaging recurring crashes, **`deployment.errors`** and
**`deployment.errorGet`** pull up the deployment's grouped, deduplicated error
**issues** — each with an occurrence count and a representative stack — so the
assistant can read what's been throwing without scrolling raw logs. See
For triaging recurring crashes, **`error.list`** and **`error.get`** pull up the
deployment's grouped, deduplicated error **issues** — each with an occurrence count
and a representative stack — so the assistant can read what's been throwing without
scrolling raw logs, and **`error.create`** lets it report an error directly. See
[application error detection](/deployments/error-detection/).

## Permissions & safety
Expand Down
141 changes: 110 additions & 31 deletions content/deployments/error-detection.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
title: 'Application error detection'
linkTitle: 'Error detection'
weight: 9
description: 'Automatic Sentry-lite error tracking — deploys.app mines your logs for stack traces and groups them into deduplicated issues with counts, a triage lifecycle, and notifications.'
lead: 'Deploys.app reads your deployment''s durable logs for application-level stack traces — Go panics, Java/Python/Node/Ruby exceptions, plus a generic fallback — and groups identical traces into deduplicated issues. Each issue carries an occurrence count, first/last-seen, a representative stack, and recent occurrences, with an open → resolved → reopened triage lifecycle. There is nothing to instrument inside your container.'
description: 'Automatic Sentry-lite error tracking — deploys.app mines your logs for stack traces, or lets your app report errors directly, and groups them into deduplicated issues with counts, a triage lifecycle, and notifications.'
lead: 'Deploys.app reads your deployment''s durable logs for application-level stack traces — Go panics, Java/Python/Node/Ruby exceptions, plus a generic fallback — and groups identical traces into deduplicated issues. Your app can also report its own errors directly via `error.create`, and a reported error merges with a log-mined trace of the same signature into one issue. Each issue carries an occurrence count, first/last-seen, a representative stack, and recent occurrences, with an open → resolved → reopened triage lifecycle. There is nothing you must instrument inside your container.'
---

## What it is
Expand All @@ -30,7 +30,7 @@ infrastructure and pod layers:
|---|---|---|
| [`deployment.health`](/automation/notification-channels/#asynchronous-failures-deploymenthealth) / auto-error | infra | no running pods, a deployer apply failure |
| [`deployment.status`](/deployments/monitoring/#reading-logs-and-status-programmatically) | pod | crash-loops, OOM-kills, pod conditions |
| **this** — `deployment.errors` | **application** | **stack traces in your log output** |
| **this** — `error.*` | **application** | **stack traces in your log output, or errors your app reports itself** |

{{< callout type="note" >}}
**Only stack traces become issues.** A lone `ERROR` or `FATAL` log line — one with
Expand Down Expand Up @@ -58,7 +58,11 @@ freshest line, read the [logs](/deployments/monitoring/) directly.
Identical traces are grouped by a fingerprint computed from the stack frames — the
function names and files, not the jittery line numbers or the free-text message — so
the same bug firing a thousand times across every replica is **one** issue with
`count: 1000`, not a thousand rows.
`count: 1000`, not a thousand rows. The same fingerprint is shared across both
sources: an error your app [reports directly](#reporting-errors-from-your-app--errorcreate)
with `error.create` and a trace the platform mines from your logs land in the **same
issue** when their stack signatures match, so reporting doesn't double-count what the
log miner would have caught anyway.

## Triage lifecycle

Expand Down Expand Up @@ -87,47 +91,49 @@ every non-static deployment.
- The **issue detail** shows the full representative stack and the recent
occurrences, each linking back to that moment in the deployment's log history.
**Resolve**, **Mute**, and **Reopen** buttons drive the
[lifecycle](#triage-lifecycle); they're gated by the `deployment.logs` permission.
[lifecycle](#triage-lifecycle); they're gated by the `error.update` permission.
- When a deployment has never thrown, the tab reads *"No application errors
detected."*

## Notifications

A **new** issue, or a **resolved** issue that **regresses**, fires a
[`deployment.error`](/automation/notification-channels/) change event. Like every
A **new** issue, or a **resolved** issue that **regresses**, fires an
[`error.detected`](/automation/notification-channels/) change event. Like every
change event, it's delivered to the project's configured
[notification channels](/automation/notification-channels/) — a webhook, a Discord
channel, or a pull queue. Only those two state transitions fire, so a recurring
error doesn't re-notify on every occurrence, and a **muted** issue never fires at
all.

Subscribe to `deployment.error` (or the wildcard `deployment.*`) on a channel to
Subscribe to `error.detected` (or the wildcard `error.*`) on a channel to
route application errors where your team will see them:

```bash
# a Discord channel that pings on any new or regressed application error
deploys notification create --project acme --name app-errors \
--type discord \
--url https://discord.com/api/webhooks/123/abc \
--event deployment.error
--event error.detected
```

{{< callout type="note" >}}
The notification message carries only the exception **type** (e.g. `panic`,
`java.lang.NullPointerException`) and a `new error:` / `error regressed:` reason —
**never** the full title or the stack. An app's error message can embed secrets it
logged, and a notification payload must stay secret-free. The full title and the
sample stack live behind the `deployment.logs` permission, in the issue itself.
**never** the full title or the sample/stack. An app's error message can embed
secrets it logged, and a notification payload must stay secret-free. The full title
and the sample stack live behind the `error.get` permission, in the issue itself.
{{< /callout >}}

## API

Three actions back the Errors tab. All are gated by the **`deployment.logs`**
permission — the same one that reads [logs](/deployments/monitoring/#permissions) —
because an issue's stack carries the same secret-bearing `stdout`. They reject
`Static` deployments, which have no logs to mine.
The `error.*` module backs the Errors tab. Reads are gated by their own
permissions — `error.list` to list issues and `error.get` to fetch an issue with its
stack — because a stack carries the same secret-bearing `stdout` as the
[logs](/deployments/monitoring/#permissions). Triage is gated by **`error.update`**,
and direct reporting by **`error.create`**. All reject `Static` deployments, which
have no logs to mine.

### `deployment.errors` — list issues
### `error.list` — list issues

| Param | Description |
|---|---|
Expand All @@ -144,33 +150,33 @@ firstSeen, lastSeen, samplePod }` — plus a `nextCursor` until the list is
exhausted.

```bash
curl https://api.deploys.app/deployment.errors \
curl https://api.deploys.app/error.list \
-H "Authorization: Bearer $DEPLOYS_TOKEN" \
-d '{ "project": "acme", "location": "gke.cluster-rcf2",
"name": "web", "status": "open", "sort": "count" }'
```

### `deployment.errorGet` — one issue, with the stack
### `error.get` — one issue, with the stack

| Param | Description |
|---|---|
| `project` | The project id. |
| `location` | The deployment's location. |
| `name` | The deployment name. |
| `id` | The issue id from `deployment.errors`. |
| `id` | The issue id from `error.list`. |

Returns the issue with its `sampleMessage` (the full representative stack) and
`recentEvents[]` — each `{ pod, timestamp, object, offset }` pointing at an
occurrence in the captured log history.

```bash
curl https://api.deploys.app/deployment.errorGet \
curl https://api.deploys.app/error.get \
-H "Authorization: Bearer $DEPLOYS_TOKEN" \
-d '{ "project": "acme", "location": "gke.cluster-rcf2",
"name": "web", "id": "…issue id…" }'
```

### `deployment.errorUpdate` — triage
### `error.update` — triage

| Param | Description |
|---|---|
Expand All @@ -185,12 +191,83 @@ Flips an issue's [status](#triage-lifecycle). Setting `resolved` marks it fixed;

```bash
# mark an issue resolved
curl https://api.deploys.app/deployment.errorUpdate \
curl https://api.deploys.app/error.update \
-H "Authorization: Bearer $DEPLOYS_TOKEN" \
-d '{ "project": "acme", "location": "gke.cluster-rcf2",
"name": "web", "id": "…issue id…", "status": "resolved" }'
```

## Reporting errors from your app — `error.create`

Log mining catches what your app **prints**. But some errors you'd rather report
explicitly — a handled exception you recover from, an error that never reaches
`stderr`, or one you want to enrich with structured frames from your own SDK.
`error.create` lets a running deployment (or an SDK embedded in it) report its own
application errors directly, instead of relying only on log mining.

A reported error and a log-mined trace with the **same stack signature merge into one
issue** — they share the same [fingerprint](#how-it-works) — so reporting and mining
reinforce each other rather than double-counting.

### Request shape

| Field | Description |
|---|---|
| `project` | The project id. |
| `location` | The deployment's location. |
| `name` | The deployment that's reporting (the deployment name). |
| `events` | The batch of error events — **up to 100** per call. |

Each entry in `events[]`:

| Field | Description |
|---|---|
| `type` | **Required.** The exception type, e.g. `panic`, `java.lang.NullPointerException`. This is the only field that ever appears in a notification. |
| `kind` | One of `go`, `java`, `python`, `node`, `ruby`, `generic`. Defaults to `generic`. |
| `title` | A short human title for the issue. |
| `frames` | The stack frames — each `{ func, file, line }`. The fingerprint is computed from these, so they decide which issue the event merges into. |
| `sample` | A representative full stack/message string for the issue detail. |
| `pod` | The reporting pod name. |
| `ts` | The occurrence timestamp. |

### Auth

The reporting app authenticates as an identity that holds the `error.create`
permission — pick whichever fits how your workload already authenticates, no new
infrastructure either way:

- a project **[service-account key](/access/service-accounts/)** — the same kind of
key your CI or [MCP](/automation/mcp/) local mode uses; or
- a **scoped token** from `me.generateToken` attenuated to `error.create`. This is
the least-privilege option: the token can do nothing but report errors, and you can
keep it short-lived.

{{< callout type="note" >}}
The [notification](#notifications) for a new or regressed issue carries only the
exception **`type`** — **never** the `title` or `sample`, which can echo application
data your app put in the error. Titles and samples stay behind the `error.get`
permission, in the issue itself.
{{< /callout >}}

```bash
# a running app reports one handled exception
curl https://api.deploys.app/error.create \
-H "Authorization: Bearer $DEPLOYS_TOKEN" \
-d '{ "project": "acme", "location": "gke.cluster-rcf2", "name": "web",
"events": [
{ "kind": "go",
"type": "panic",
"title": "runtime error: invalid memory address or nil pointer dereference",
"frames": [
{ "func": "main.(*Handler).Serve", "file": "handler.go", "line": 142 },
{ "func": "net/http.(*conn).serve", "file": "server.go", "line": 2092 }
],
"sample": "panic: runtime error: invalid memory address or nil pointer dereference\n\tmain.(*Handler).Serve(...)\n\t\thandler.go:142",
"pod": "web-7d9c8b6f4-abcde",
"ts": "2026-06-21T10:04:00Z" }
] }'
```

### Kinds

The `kind` field tells you which runtime threw, and drives the icon in the console:
Expand All @@ -206,16 +283,18 @@ The `kind` field tells you which runtime threw, and drives the icon in the conso

## From the CLI and AI assistants

The same three actions are available outside the console:
The `error.*` actions are available outside the console too:

- The **CLI** surfaces them under `deploys deployment errors` (list, get, and
resolve), so a script or CI job can read and triage issues without the console.
- The **[MCP server](/automation/mcp/)** exposes the error-listing and
error-detail actions, so an AI assistant can pull up a deployment's open issues
and read the stack as part of a diagnose-and-fix loop.
- The **CLI** surfaces them under `deploys error` — `list`, `get`, and `update` to
read and triage issues, plus `deploys error report` to send an `error.create`
event from a script or CI job without the console.
- The **[MCP server](/automation/mcp/)** exposes the error-listing and error-detail
tools (and an `error.create` tool), so an AI assistant can pull up a deployment's
open issues, read the stack, and even report errors as part of a
diagnose-and-fix loop.

Both wrap the same `deployment.logs`-gated API, so they can only see what your
identity is allowed to.
Both wrap the same `error.*` API, so they can only see and do what your identity is
allowed to.

## Retention

Expand Down