From 34bc9cb1d7c2bddbcfba59b37d07436aa3ece8ce Mon Sep 17 00:00:00 2001
From: Manas Srivastava <mastermanas805@gmail.com>
Date: Sat, 6 Jun 2026 09:31:10 +0530
Subject: [PATCH] docs: agent-facing deploy-failure auto-debug guide + llms.txt
 reference (Task #69)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Make the deploy-failure auto-debug path discoverable by AI agents from
llms.txt. A failed deploy classifies the cause and serves it back over
HTTP — agents can self-recover without cluster access.

- New docs page docs/troubleshooting-deploys.md (served at
  /docs#troubleshooting-deploys + /docs/troubleshooting-deploys.md):
  the GET /api/v1/deployments/:id/events autopsy loop
  (reason + last_lines + hint -> fix -> POST /deploy/:id/redeploy ->
  re-poll), the live SSE build log (GET /deploy/:id/logs), the thinner
  anonymous-stacks path (GET /stacks/:slug + /stacks/:slug/logs/:svc,
  no /events autopsy), and honest caveats (email delivery blocked,
  diagnostics-pending window, thinner runtime crash-loop diagnostics).
- llms.txt: new "Debugging a failed deploy" section pointing agents at
  /events (reason, last_lines, hint) with a link to the full guide, plus
  an md-mirror entry under the text-only routes list.
- Renumbered later docs orders (claim 6, auth 7, limits 8,
  machine-readable 9) to slot the troubleshooting page after deploy/stacks.

All endpoint paths verified against the live api router (deploy.go /
stack.go / router.go): /deployments/:id, /deployments/:id/events,
/deploy/:id/logs, /deploy/:id/redeploy, /stacks/:slug,
/stacks/:slug/logs/:svc.

Source: InstaNode-dev/docs ci/02-FAILURE-DIAGNOSIS-AND-AUTODEBUG.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/authentication.md          |   2 +-
 docs/claim.md                   |   2 +-
 docs/limits.md                  |   2 +-
 docs/machine-readable.md        |   2 +-
 docs/troubleshooting-deploys.md | 139 ++++++++++++++++++++++++++++++++
 llms.txt                        |  13 ++-
 6 files changed, 155 insertions(+), 5 deletions(-)
 create mode 100644 docs/troubleshooting-deploys.md

diff --git a/docs/authentication.md b/docs/authentication.md
index dc8f73d..6b9f0ea 100644
--- a/docs/authentication.md
+++ b/docs/authentication.md
@@ -1,6 +1,6 @@
 ---
 title: Authentication
-order: 6
+order: 7
 ---
 
 Resource provisioning is anonymous. Everything else (deploy, vault, billing,
diff --git a/docs/claim.md b/docs/claim.md
index 3147e58..1bc681d 100644
--- a/docs/claim.md
+++ b/docs/claim.md
@@ -1,6 +1,6 @@
 ---
 title: Claim flow (anonymous → paid)
-order: 5
+order: 6
 ---
 
 Anonymous resources expire in 24 hours. To keep them, claim them.
diff --git a/docs/limits.md b/docs/limits.md
index 42bf088..70009cd 100644
--- a/docs/limits.md
+++ b/docs/limits.md
@@ -1,6 +1,6 @@
 ---
 title: Tiers and limits
-order: 7
+order: 8
 ---
 
 | Tier       | Postgres    | Redis     | MongoDB      | TTL  | Price       |
diff --git a/docs/machine-readable.md b/docs/machine-readable.md
index 7198b30..2054007 100644
--- a/docs/machine-readable.md
+++ b/docs/machine-readable.md
@@ -1,6 +1,6 @@
 ---
 title: Machine-readable API
-order: 8
+order: 9
 ---
 
 The full API surface is described in OpenAPI 3.1 at:
diff --git a/docs/troubleshooting-deploys.md b/docs/troubleshooting-deploys.md
new file mode 100644
index 0000000..492011b
--- /dev/null
+++ b/docs/troubleshooting-deploys.md
@@ -0,0 +1,139 @@
+---
+title: Debugging a failed deploy (for AI agents)
+order: 5
+---
+
+A deploy can fail at build (bad Dockerfile, missing file in the tarball,
+dependency error) or roll out but crash at runtime. You do **not** need
+cluster access to diagnose it — the platform classifies the failure and
+serves the real error back to you over HTTP. This page is written for an
+AI agent running a deploy → fix → redeploy loop.
+
+## The auto-debug loop (authenticated deploys via `POST /deploy/new`)
+
+When a deploy you started with `POST /deploy/new` ends up `failed`, run
+this loop. The reliable machine surface is `GET /api/v1/deployments/:id/events` —
+it is self-contained (reason + last_lines + hint) and needs only your
+session token.
+
+1. **Watch the build live (optional).** Stream the build log over SSE while
+   it builds:
+
+   ```
+   curl -N https://api.instanode.dev/deploy/<id>/logs \
+     -H "Authorization: Bearer $INSTANODE_TOKEN"
+   ```
+
+2. **Get the one-line status + summary.**
+
+   ```
+   curl https://api.instanode.dev/api/v1/deployments/<id> \
+     -H "Authorization: Bearer $INSTANODE_TOKEN"
+   ```
+
+   Returns `status` (`building` / `failed` / `running` / `expired`) and,
+   on failure, `error_message` — a `<reason>: <hint snippet>` summary
+   (Kaniko error / ImagePullBackOff / BackoffLimitExceeded / DeadlineExceeded).
+
+3. **Read the classified cause — the real error.** This is the surface to
+   act on:
+
+   ```
+   curl https://api.instanode.dev/api/v1/deployments/<id>/events \
+     -H "Authorization: Bearer $INSTANODE_TOKEN"
+   ```
+
+   Returns `{events, count}` where each event is:
+
+   ```json
+   {
+     "kind": "failure_autopsy",
+     "reason": "BackoffLimitExceeded",
+     "exit_code": 1,
+     "event": "...",
+     "last_lines": ["...", "the tail of the build-pod log — the real error output"],
+     "hint": "plain-language remedy",
+     "created_at": "2026-06-06T..."
+   }
+   ```
+
+   - `reason` — the classified failure class.
+   - `last_lines` — the **tail of the build-pod logs**, the actual compiler /
+     installer / Kaniko output that explains the failure. Read this first.
+   - `hint` — a plain-language remedy for that reason.
+
+4. **Fix it.** Edit the Dockerfile, the tarball contents, the `port`, or
+   the `env_vars` per `hint` + `last_lines`. Common cases: a missing file
+   that needed to be in the tar, a build step that needs a dependency, a
+   wrong base image, an app that listens on a port other than the one you
+   passed.
+
+5. **Redeploy in place** (same `app_id`, same URL, slot count unchanged):
+
+   ```
+   curl -X POST https://api.instanode.dev/deploy/<id>/redeploy \
+     -H "Authorization: Bearer $INSTANODE_TOKEN"
+   ```
+
+   Or pass `redeploy=true` on `POST /deploy/new` with the **same** `name`
+   you used originally — the platform rebuilds the existing deployment in
+   place and the response carries `"redeployed": true`. (Without
+   `redeploy=true` a fresh `POST /deploy/new` mints a NEW app and a NEW
+   URL, even when `name` collides.)
+
+6. **Re-verify.** Poll `GET /api/v1/deployments/<id>` until `status` is
+   `running` — or loop back to step 3 if it failed again.
+
+## Anonymous deploys (via `POST /stacks/new`)
+
+Anonymous (no-Bearer) callers cannot use `/deploy/new` — they deploy via
+`POST /stacks/new` (anonymous stacks carry no team and expire after a 6h
+TTL). The failure-diagnosis path for an anonymous stack is **thinner**:
+
+1. **Status + raw error.** Read the stack by its slug (no auth needed —
+   the slug is the bearer):
+
+   ```
+   curl https://api.instanode.dev/api/v1/stacks/<slug>
+   ```
+
+   On failure this returns `status="failed"` plus the raw error string.
+
+2. **Per-service build logs.**
+
+   ```
+   curl https://api.instanode.dev/stacks/<slug>/logs/<service>
+   ```
+
+Anonymous stacks do **not** have the classified `/events` autopsy
+(`reason` / `last_lines` / `hint`) — there is no `/stacks/:slug/events`
+endpoint. That is a known thinner path: anonymous deploys get **status +
+raw error + logs**, not the classified autopsy. Claim/upgrade to deploy
+via `/deploy/new` for the full debug surface.
+
+## Caveats (read these — they affect how you diagnose)
+
+- **Don't rely on the failure email.** A `failed` deploy records a failure
+  notification, but transactional email delivery is currently blocked (the
+  sender domain isn't validated in prod), so the email may not reach a real
+  inbox. Use `GET /api/v1/deployments/:id/events` and the dashboard
+  failure-autopsy panel as the source of truth, not email.
+- **"Diagnostics pending" window.** For a few seconds right after a
+  failure the autopsy is still capturing the build-pod logs — `/events`
+  may be empty or carry `reason="Unknown"`. Wait a moment and re-poll.
+- **Runtime crash-loops are thinner than build failures.** A deploy that
+  *builds* fine but crash-loops at runtime (CrashLoopBackOff, OOMKilled,
+  readiness-probe failure) has less customer-facing diagnostics today than
+  the build-failure autopsy. Build-failure diagnosis is the
+  well-instrumented path; deeper runtime crash-loop visibility is a known
+  follow-up.
+
+## Surfaces at a glance
+
+| Surface | What it gives you |
+| --- | --- |
+| `GET /api/v1/deployments/:id` | `status` + one-line `error_message` |
+| `GET /api/v1/deployments/:id/events` | classified `reason` + `last_lines` + `hint` (the real error — use this) |
+| `GET /deploy/:id/logs` | live build log stream (SSE) |
+| `GET /api/v1/stacks/:slug` | anonymous-stack `status` + raw error string |
+| `GET /stacks/:slug/logs/:svc` | anonymous-stack per-service build logs |
diff --git a/llms.txt b/llms.txt
index e55bf6b..0419658 100644
--- a/llms.txt
+++ b/llms.txt
@@ -123,7 +123,8 @@ Every page has a `.md` mirror at the same path. Examples:
 - [/pricing.md](https://instanode.dev/pricing.md) — tier comparison
 - [/for-agents.md](https://instanode.dev/for-agents.md) — agent-specific guidance
 - [/status.md](https://instanode.dev/status.md) — current status
-- [/docs.md](https://instanode.dev/docs.md) — quickstart + service reference (all 9 sections concatenated)
+- [/docs.md](https://instanode.dev/docs.md) — quickstart + service reference (all sections concatenated)
+- [/docs/troubleshooting-deploys.md](https://instanode.dev/docs/troubleshooting-deploys.md) — agent self-recovery guide for a failed deploy (the `/events` autopsy loop)
 - [/blog.md](https://instanode.dev/blog.md) — blog index (post titles + dates + links)
 - [/blog/<slug>.md](https://instanode.dev/blog/) — every post
 - [/use-cases.md](https://instanode.dev/use-cases.md) — catalogue index (all 104 grouped by category)
@@ -162,6 +163,16 @@ Anonymous tier (24h TTL): no email on file, `DELETE` returns 200 immediately. Se
 
 - [Deployments](https://instanode.dev/docs#deploy): POST /deploy/new + private deploys + custom domains + two-step email-confirmed deletion + 24h-TTL semantics (default; three routes to keep a deploy permanent)
 
+## Debugging a failed deploy (agent self-recovery)
+
+Deploy failed? You don't need cluster access — the platform classifies the failure and serves the real error back over HTTP. Read the classified cause at `GET /api/v1/deployments/:id/events` → `events[]` with `reason`, `last_lines` (the tail of the build-pod log = the actual error output), and `hint` (a plain-language remedy). For the one-line summary use `GET /api/v1/deployments/:id` → `status` + `error_message`. The loop: read `reason` + `last_lines` + `hint` → fix the Dockerfile / tarball / `port` / `env_vars` → redeploy in place (`POST /deploy/:id/redeploy`, or `redeploy=true` on `POST /deploy/new` with the same `name`) → re-poll status until `running`. Watch the build live with the SSE stream `GET /deploy/:id/logs`.
+
+**Anonymous deploys** go through `POST /stacks/new` and have a thinner failure path: `GET /api/v1/stacks/:slug` returns `status="failed"` + the raw error string, and `GET /stacks/:slug/logs/:svc` returns per-service build logs — but there is **no** classified `/events` autopsy for anonymous stacks (a known thinner path). Claim/upgrade and deploy via `/deploy/new` for the full `reason`/`last_lines`/`hint` surface.
+
+**Caveats:** don't rely on the failure email (transactional email delivery is currently blocked — use `/events` and the dashboard failure-autopsy panel instead); there is a brief "diagnostics pending" window right after a failure where `/events` may be empty or `reason="Unknown"` (re-poll); and runtime crash-loops have thinner diagnostics today than build failures.
+
+Full guide: [https://instanode.dev/docs#troubleshooting-deploys](https://instanode.dev/docs#troubleshooting-deploys) (markdown mirror: [/docs/troubleshooting-deploys.md](https://instanode.dev/docs/troubleshooting-deploys.md)).
+
 ## How to use this file
 
 If you're an LLM helping a user build something, you can: