From 34bc9cb1d7c2bddbcfba59b37d07436aa3ece8ce Mon Sep 17 00:00:00 2001 From: Manas Srivastava Date: Sat, 6 Jun 2026 09:31:10 +0530 Subject: [PATCH] docs: agent-facing deploy-failure auto-debug guide + llms.txt reference (Task #69) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Make the deploy-failure auto-debug path discoverable by AI agents from llms.txt. A failed deploy classifies the cause and serves it back over HTTP — agents can self-recover without cluster access. - New docs page docs/troubleshooting-deploys.md (served at /docs#troubleshooting-deploys + /docs/troubleshooting-deploys.md): the GET /api/v1/deployments/:id/events autopsy loop (reason + last_lines + hint -> fix -> POST /deploy/:id/redeploy -> re-poll), the live SSE build log (GET /deploy/:id/logs), the thinner anonymous-stacks path (GET /stacks/:slug + /stacks/:slug/logs/:svc, no /events autopsy), and honest caveats (email delivery blocked, diagnostics-pending window, thinner runtime crash-loop diagnostics). - llms.txt: new "Debugging a failed deploy" section pointing agents at /events (reason, last_lines, hint) with a link to the full guide, plus an md-mirror entry under the text-only routes list. - Renumbered later docs orders (claim 6, auth 7, limits 8, machine-readable 9) to slot the troubleshooting page after deploy/stacks. All endpoint paths verified against the live api router (deploy.go / stack.go / router.go): /deployments/:id, /deployments/:id/events, /deploy/:id/logs, /deploy/:id/redeploy, /stacks/:slug, /stacks/:slug/logs/:svc. Source: InstaNode-dev/docs ci/02-FAILURE-DIAGNOSIS-AND-AUTODEBUG.md. Co-Authored-By: Claude Opus 4.8 --- docs/authentication.md | 2 +- docs/claim.md | 2 +- docs/limits.md | 2 +- docs/machine-readable.md | 2 +- docs/troubleshooting-deploys.md | 139 ++++++++++++++++++++++++++++++++ llms.txt | 13 ++- 6 files changed, 155 insertions(+), 5 deletions(-) create mode 100644 docs/troubleshooting-deploys.md diff --git a/docs/authentication.md b/docs/authentication.md index dc8f73d..6b9f0ea 100644 --- a/docs/authentication.md +++ b/docs/authentication.md @@ -1,6 +1,6 @@ --- title: Authentication -order: 6 +order: 7 --- Resource provisioning is anonymous. Everything else (deploy, vault, billing, diff --git a/docs/claim.md b/docs/claim.md index 3147e58..1bc681d 100644 --- a/docs/claim.md +++ b/docs/claim.md @@ -1,6 +1,6 @@ --- title: Claim flow (anonymous → paid) -order: 5 +order: 6 --- Anonymous resources expire in 24 hours. To keep them, claim them. diff --git a/docs/limits.md b/docs/limits.md index 42bf088..70009cd 100644 --- a/docs/limits.md +++ b/docs/limits.md @@ -1,6 +1,6 @@ --- title: Tiers and limits -order: 7 +order: 8 --- | Tier | Postgres | Redis | MongoDB | TTL | Price | diff --git a/docs/machine-readable.md b/docs/machine-readable.md index 7198b30..2054007 100644 --- a/docs/machine-readable.md +++ b/docs/machine-readable.md @@ -1,6 +1,6 @@ --- title: Machine-readable API -order: 8 +order: 9 --- The full API surface is described in OpenAPI 3.1 at: diff --git a/docs/troubleshooting-deploys.md b/docs/troubleshooting-deploys.md new file mode 100644 index 0000000..492011b --- /dev/null +++ b/docs/troubleshooting-deploys.md @@ -0,0 +1,139 @@ +--- +title: Debugging a failed deploy (for AI agents) +order: 5 +--- + +A deploy can fail at build (bad Dockerfile, missing file in the tarball, +dependency error) or roll out but crash at runtime. You do **not** need +cluster access to diagnose it — the platform classifies the failure and +serves the real error back to you over HTTP. This page is written for an +AI agent running a deploy → fix → redeploy loop. + +## The auto-debug loop (authenticated deploys via `POST /deploy/new`) + +When a deploy you started with `POST /deploy/new` ends up `failed`, run +this loop. The reliable machine surface is `GET /api/v1/deployments/:id/events` — +it is self-contained (reason + last_lines + hint) and needs only your +session token. + +1. **Watch the build live (optional).** Stream the build log over SSE while + it builds: + + ``` + curl -N https://api.instanode.dev/deploy//logs \ + -H "Authorization: Bearer $INSTANODE_TOKEN" + ``` + +2. **Get the one-line status + summary.** + + ``` + curl https://api.instanode.dev/api/v1/deployments/ \ + -H "Authorization: Bearer $INSTANODE_TOKEN" + ``` + + Returns `status` (`building` / `failed` / `running` / `expired`) and, + on failure, `error_message` — a `: ` summary + (Kaniko error / ImagePullBackOff / BackoffLimitExceeded / DeadlineExceeded). + +3. **Read the classified cause — the real error.** This is the surface to + act on: + + ``` + curl https://api.instanode.dev/api/v1/deployments//events \ + -H "Authorization: Bearer $INSTANODE_TOKEN" + ``` + + Returns `{events, count}` where each event is: + + ```json + { + "kind": "failure_autopsy", + "reason": "BackoffLimitExceeded", + "exit_code": 1, + "event": "...", + "last_lines": ["...", "the tail of the build-pod log — the real error output"], + "hint": "plain-language remedy", + "created_at": "2026-06-06T..." + } + ``` + + - `reason` — the classified failure class. + - `last_lines` — the **tail of the build-pod logs**, the actual compiler / + installer / Kaniko output that explains the failure. Read this first. + - `hint` — a plain-language remedy for that reason. + +4. **Fix it.** Edit the Dockerfile, the tarball contents, the `port`, or + the `env_vars` per `hint` + `last_lines`. Common cases: a missing file + that needed to be in the tar, a build step that needs a dependency, a + wrong base image, an app that listens on a port other than the one you + passed. + +5. **Redeploy in place** (same `app_id`, same URL, slot count unchanged): + + ``` + curl -X POST https://api.instanode.dev/deploy//redeploy \ + -H "Authorization: Bearer $INSTANODE_TOKEN" + ``` + + Or pass `redeploy=true` on `POST /deploy/new` with the **same** `name` + you used originally — the platform rebuilds the existing deployment in + place and the response carries `"redeployed": true`. (Without + `redeploy=true` a fresh `POST /deploy/new` mints a NEW app and a NEW + URL, even when `name` collides.) + +6. **Re-verify.** Poll `GET /api/v1/deployments/` until `status` is + `running` — or loop back to step 3 if it failed again. + +## Anonymous deploys (via `POST /stacks/new`) + +Anonymous (no-Bearer) callers cannot use `/deploy/new` — they deploy via +`POST /stacks/new` (anonymous stacks carry no team and expire after a 6h +TTL). The failure-diagnosis path for an anonymous stack is **thinner**: + +1. **Status + raw error.** Read the stack by its slug (no auth needed — + the slug is the bearer): + + ``` + curl https://api.instanode.dev/api/v1/stacks/ + ``` + + On failure this returns `status="failed"` plus the raw error string. + +2. **Per-service build logs.** + + ``` + curl https://api.instanode.dev/stacks//logs/ + ``` + +Anonymous stacks do **not** have the classified `/events` autopsy +(`reason` / `last_lines` / `hint`) — there is no `/stacks/:slug/events` +endpoint. That is a known thinner path: anonymous deploys get **status + +raw error + logs**, not the classified autopsy. Claim/upgrade to deploy +via `/deploy/new` for the full debug surface. + +## Caveats (read these — they affect how you diagnose) + +- **Don't rely on the failure email.** A `failed` deploy records a failure + notification, but transactional email delivery is currently blocked (the + sender domain isn't validated in prod), so the email may not reach a real + inbox. Use `GET /api/v1/deployments/:id/events` and the dashboard + failure-autopsy panel as the source of truth, not email. +- **"Diagnostics pending" window.** For a few seconds right after a + failure the autopsy is still capturing the build-pod logs — `/events` + may be empty or carry `reason="Unknown"`. Wait a moment and re-poll. +- **Runtime crash-loops are thinner than build failures.** A deploy that + *builds* fine but crash-loops at runtime (CrashLoopBackOff, OOMKilled, + readiness-probe failure) has less customer-facing diagnostics today than + the build-failure autopsy. Build-failure diagnosis is the + well-instrumented path; deeper runtime crash-loop visibility is a known + follow-up. + +## Surfaces at a glance + +| Surface | What it gives you | +| --- | --- | +| `GET /api/v1/deployments/:id` | `status` + one-line `error_message` | +| `GET /api/v1/deployments/:id/events` | classified `reason` + `last_lines` + `hint` (the real error — use this) | +| `GET /deploy/:id/logs` | live build log stream (SSE) | +| `GET /api/v1/stacks/:slug` | anonymous-stack `status` + raw error string | +| `GET /stacks/:slug/logs/:svc` | anonymous-stack per-service build logs | diff --git a/llms.txt b/llms.txt index e55bf6b..0419658 100644 --- a/llms.txt +++ b/llms.txt @@ -123,7 +123,8 @@ Every page has a `.md` mirror at the same path. Examples: - [/pricing.md](https://instanode.dev/pricing.md) — tier comparison - [/for-agents.md](https://instanode.dev/for-agents.md) — agent-specific guidance - [/status.md](https://instanode.dev/status.md) — current status -- [/docs.md](https://instanode.dev/docs.md) — quickstart + service reference (all 9 sections concatenated) +- [/docs.md](https://instanode.dev/docs.md) — quickstart + service reference (all sections concatenated) +- [/docs/troubleshooting-deploys.md](https://instanode.dev/docs/troubleshooting-deploys.md) — agent self-recovery guide for a failed deploy (the `/events` autopsy loop) - [/blog.md](https://instanode.dev/blog.md) — blog index (post titles + dates + links) - [/blog/.md](https://instanode.dev/blog/) — every post - [/use-cases.md](https://instanode.dev/use-cases.md) — catalogue index (all 104 grouped by category) @@ -162,6 +163,16 @@ Anonymous tier (24h TTL): no email on file, `DELETE` returns 200 immediately. Se - [Deployments](https://instanode.dev/docs#deploy): POST /deploy/new + private deploys + custom domains + two-step email-confirmed deletion + 24h-TTL semantics (default; three routes to keep a deploy permanent) +## Debugging a failed deploy (agent self-recovery) + +Deploy failed? You don't need cluster access — the platform classifies the failure and serves the real error back over HTTP. Read the classified cause at `GET /api/v1/deployments/:id/events` → `events[]` with `reason`, `last_lines` (the tail of the build-pod log = the actual error output), and `hint` (a plain-language remedy). For the one-line summary use `GET /api/v1/deployments/:id` → `status` + `error_message`. The loop: read `reason` + `last_lines` + `hint` → fix the Dockerfile / tarball / `port` / `env_vars` → redeploy in place (`POST /deploy/:id/redeploy`, or `redeploy=true` on `POST /deploy/new` with the same `name`) → re-poll status until `running`. Watch the build live with the SSE stream `GET /deploy/:id/logs`. + +**Anonymous deploys** go through `POST /stacks/new` and have a thinner failure path: `GET /api/v1/stacks/:slug` returns `status="failed"` + the raw error string, and `GET /stacks/:slug/logs/:svc` returns per-service build logs — but there is **no** classified `/events` autopsy for anonymous stacks (a known thinner path). Claim/upgrade and deploy via `/deploy/new` for the full `reason`/`last_lines`/`hint` surface. + +**Caveats:** don't rely on the failure email (transactional email delivery is currently blocked — use `/events` and the dashboard failure-autopsy panel instead); there is a brief "diagnostics pending" window right after a failure where `/events` may be empty or `reason="Unknown"` (re-poll); and runtime crash-loops have thinner diagnostics today than build failures. + +Full guide: [https://instanode.dev/docs#troubleshooting-deploys](https://instanode.dev/docs#troubleshooting-deploys) (markdown mirror: [/docs/troubleshooting-deploys.md](https://instanode.dev/docs/troubleshooting-deploys.md)). + ## How to use this file If you're an LLM helping a user build something, you can: