PRD: sanddune v1 — orchestrate AI coding agents in isolated sandboxes

## Problem Statement

I want to run AI coding agents on many issues at once, without trashing my working tree, without supervising each one, and without writing the same orchestration glue in every project. Today I either:

- Run an agent live in my repo and babysit it (no parallelism, no isolation).
- Hand-roll a Docker setup per project, copy-paste prompt loops, and discover at 2am that one agent stomped another's branch.
- Ship the agent into a remote VM and lose the ability to ergonomically inspect commits on my host.

I need a primitive that takes "a prompt, an agent, a sandbox" and reliably returns "a commit on a branch" — so I can build review pipelines, parallel planners, and AFK loops on top.

## Solution

A TypeScript library, **sanddune**, that orchestrates agents inside sandboxes. The user writes a prompt, calls `sanddune.run()`, and gets a commit on a branch. The library is provider-agnostic on both axes:

- **Sandbox providers** — Docker, Podman, Vercel Firecracker microVMs, no-sandbox, or a custom provider implementing a single interface. Two flavors: bind-mount (host directory mounted in) and isolated (sandbox has its own filesystem; sanddune syncs in/out).
- **Agent providers** — Claude Code, Codex, opencode, pi, or a custom provider.

A run goes through three phases: setup (worktree + sandbox + hooks), agent loop (invoke, stream, repeat up to `maxIterations` or until a `completionSignal` fires), teardown (capture commits, tear down sandbox).

Four programmatic entry points expose increasing levels of control: `run()` (one-shot), `createSandbox()` (reusable sandbox on one branch), `createWorktree()` (worktree as independent lifecycle), `interactive()` (TUI session, sync). A CLI (`sanddune init`, `sanddune docker build-image`, etc.) handles scaffolding and image lifecycle.

## User Stories

1. As a developer, I want to install sanddune as a dev dependency and scaffold a sandbox config with one command, so that I can start running agents in the same afternoon I learned the library exists.
2. As a developer, I want `sanddune init` to refuse to overwrite an existing `.sanddune/` directory, so that I never lose customizations to my Dockerfile or prompt.
3. As a developer, I want to pick my agent (Claude Code / Codex / opencode / pi) and template (`blank`, `simple-loop`, `sequential-reviewer`, `parallel-planner`, `parallel-planner-with-review`) interactively during init, so that I can scaffold the right starter without remembering flag syntax.
4. As a developer, I want `sanddune init` to scaffold `main.mts` when my `package.json` lacks `"type": "module"` and `main.ts` when it does, so that the scaffold runs without ESM/CJS friction.
5. As a developer, I want a `prompt.md` file scaffolded into `.sanddune/` as a *convention* (not a magic path sanddune reads on its own), so that I can rename it, ignore it, or reference it explicitly via `promptFile`.
6. As a developer, I want to call `run()` with `agent`, `sandbox`, and `prompt` and have it commit to my repo and tear the sandbox down, so that I can run an agent in three lines of code.
7. As a developer, I want bind-mount providers to default to the **head** branch strategy, so that the fastest path during local development is also the default.
8. As an automation engineer, I want **merge-to-head** as the default for isolated providers, so that AFK runs can never trash my working tree on failure.
9. As a developer, I want to opt into the **branch** strategy with an explicit branch name, so that I can wire sanddune into a PR-creation pipeline.
10. As a developer, I want **head** to be a compile-time error on isolated providers, so that I can't ship an impossible config.
11. As a developer, I want **head** to be a compile-time error on `createWorktree()`, so that the type system rules out worktree-less worktrees.
12. As a developer, I want to import sandbox providers from explicit subpaths (e.g. `@missingstudio/sanddune/sandboxes/docker`), so that my bundle doesn't pull in every provider's dependencies.
13. As a developer, I want `noSandbox()` to be accepted only by `interactive()` and `wt.interactive()`, so that I can't accidentally launch an unsupervised AFK run on the host.
14. As a developer, I want `noSandbox()` to *not* pass `--dangerously-skip-permissions` to the agent, so that Claude Code's normal permission prompts stay active during my interactive sessions.
15. As a developer, I want to write a custom sandbox provider by implementing one factory (`createBindMountSandboxProvider` or `createIsolatedSandboxProvider`), so that I can target any container runtime, VM, or sandbox service my org standardizes on.
16. As a developer, I want every `exec` call to return `{ stdout, stderr, exitCode }`, so that custom providers have no ambiguity about their contract.
17. As a developer, I want streaming output via an optional `onLine` callback on `exec`, so that my custom provider can plug into sanddune's logging without buffering the entire stream.
18. As an automation engineer, I want to call `run()` once and get back `result.iterations`, `result.commits`, `result.branch`, and `result.completionSignal`, so that my downstream code can decide what to do without re-parsing logs.
19. As a developer, I want a default `completionSignal` of `<promise>COMPLETE</promise>`, so that my prompts and the library agree on a convention out of the box.
20. As a developer, I want to override `completionSignal` per run with a string or a list of strings, so that I can model multiple terminal states (e.g. `TASK_COMPLETE` vs `TASK_ABORTED`).
21. As a developer, I want substring-based detection of completion signals against the agent's output stream, so that I don't need a structured protocol — any unique sentinel works.
22. As a developer, I want `result.completionSignal` to tell me *which* signal matched (or `undefined` if none), so that I can branch on the agent's terminal state.
23. As a developer, I want sanddune to never inject the completion signal into my prompt, so that the convention stays a convention I write into my prompt deliberately.
24. As an automation engineer, I want `maxIterations` to bound any agent loop, so that runaway agents don't burn budget unbounded.
25. As a developer, I want `idleTimeoutSeconds` (default 600) to abort the run if the agent goes silent, so that hung subprocesses don't keep the sandbox alive forever.
26. As a developer, I want `idleTimeoutSeconds` to reset on every agent output event, so that long-but-active iterations aren't killed prematurely.
27. As a developer, I want to pass an `AbortSignal` to `run()` and `interactive()`, so that I can cancel from a CLI signal handler or a parent supervisor.
28. As a developer, I want aborting to kill the agent subprocess and any in-flight hooks immediately, so that cancellation is responsive.
29. As a developer, I want aborted runs to preserve the worktree on disk and reject with `signal.reason` verbatim, so that I can inspect partial work and propagate the cancellation cause.
30. As a developer, I want `createSandbox()` to give me a reusable sandbox on a single branch, so that multi-step pipelines (implement → review → revise) don't pay container startup cost per step.
31. As a developer, I want `createSandbox()` to preserve installed dependencies and build artifacts between calls, so that an `npm install` in step 1 doesn't repeat in step 2.
32. As a developer, I want `await using sandbox = await createSandbox(...)` to auto-tear-down via `Symbol.asyncDispose`, so that I never leak containers when an exception escapes the block.
33. As a developer, I want manual `sandbox.close()` to return `{ preservedWorktreePath }` set when the worktree was dirty, so that I know where to look on disk after a failure.
34. As a developer, I want clean worktrees auto-removed and dirty worktrees preserved, so that successful runs don't leave clutter and failed runs don't lose work.
35. As a developer, I want `createWorktree()` to give me a worktree as a first-class lifecycle, so that I can run an interactive session, then a sandboxed AFK agent, against the same worktree.
36. As a developer, I want `createWorktree()` to reject `head` at compile time, so that "no worktree" can't sneak past type checks.
37. As a developer, I want `wt.createSandbox()` to be the conceptual primitive (with top-level `createSandbox()` as a bundled convenience), so that the underlying ownership model is explicit when I need it.
38. As a developer, I want split ownership: `sandbox.close()` from `wt.createSandbox()` tears down only the container; `wt.close()` cleans the worktree, so that I can keep the worktree alive across multiple sandbox instances.
39. As a developer, I want top-level `createSandbox()`'s `sandbox.close()` to tear down both container and worktree, so that the convenience case has matching convenience cleanup.
40. As a developer, I want `interactive()` to launch the agent's TUI inside a sandbox or on the host, so that I can drop into a fresh shell with my repo already in place.
41. As a developer, I want `interactive()` to return control synchronously when the user exits the TUI, so that orchestration code can resume cleanly.
42. As a developer, I want top-level `interactive()` to always use the provider's default branch strategy, so that the simple case is unambiguous; for non-default strategies I'll route through `createWorktree() + wt.interactive()`.
43. As a prompt author, I want `prompt: "..."` to pass my string through to the agent literally, so that I never have an unexpected `{{` or `` !` `` interpretation.
44. As a prompt author, I want passing both `prompt` and `promptFile` (or `promptArgs` with `prompt`) to error, so that the contract is unambiguous.
45. As a prompt author, I want `promptFile: "./path.md"` to enable substitution and shell expansion, so that I can write reusable templates with embedded context.
46. As a prompt author, I want `{{KEY}}` placeholders substituted from `promptArgs` *before* shell expansion, so that I can embed args inside `` !`gh issue view {{ISSUE_NUMBER}}` ``.
47. As a prompt author, I want a missing `{{KEY}}` in `run()` (AFK mode) to error with the missing key name, so that typos surface immediately.
48. As a prompt author, I want a missing `{{KEY}}` in `interactive()` to prompt me at the terminal, so that I can recover without restarting the session.
49. As a prompt author, I want unused `promptArgs` keys to log a warning, not fail, so that scripts that pass shared arg maps still work.
50. As a prompt author, I want `{{SOURCE_BRANCH}}` and `{{TARGET_BRANCH}}` injected automatically into every prompt, so that branching info is one substitution away in any template.
51. As a prompt author, I want passing `SOURCE_BRANCH` or `TARGET_BRANCH` in `promptArgs` to error, so that built-in arguments are unambiguously authoritative.
52. As a prompt author, I want `` !`command` `` expressions evaluated in parallel, so that fetching multiple bits of context (issues, commits, status) doesn't serialize.
53. As a prompt author, I want shell expressions evaluated *inside* the sandbox after `sandbox.onSandboxReady` hooks, so that they see the same repo state the agent will see.
54. As a prompt author, I want a non-zero exit from any shell expression to fail the run immediately, so that I never hand a half-rendered prompt to the agent.
55. As a prompt author, I want `` !`...` `` patterns inside `promptArgs` *values* treated as inert text, so that I can pass user-authored content (issue titles, PR descriptions) without command-injection risk.
56. As a Claude Code user, I want sanddune to capture the per-iteration session JSONL to my host at `~/.claude/projects/<encoded-path>/sessions/<id>.jsonl`, so that `claude --resume` works natively after the run.
57. As a Claude Code user, I want `cwd` fields inside captured session files rewritten to my host repo root, so that resume reopens in the right directory.
58. As a Claude Code user, I want `captureSessions: false` to opt out of capture, so that one-off runs don't pollute my sessions directory.
59. As a Claude Code user, I want capture failure to log a warning and leave `sessionFilePath` undefined, but **not** fail the run, so that disk-full or permissions issues don't kill an otherwise-successful agent.
60. As a Claude Code user, I want `resumeSession: "<id>"` to validate the session file exists, transfer it into the sandbox with `cwd` rewritten, and pass `--resume <id>` on iteration 1 only, so that I can continue prior work without ceremony.
61. As a Claude Code user, I want `resumeSession` with `maxIterations > 1` to throw before sandbox creation, so that the contract (resume only iteration 1) is enforced loudly.
62. As a Claude Code user, I want `resumeSession` rejected on `sandbox.run()`, so that long-lived sandboxes — which don't chain Claude session state across calls — can't pretend to.
63. As a developer using a non-Claude agent, I want `captureSessions` and `resumeSession` to be no-ops, so that my agent provider doesn't fail on Claude-specific options.
64. As a developer, I want `host.onWorktreeReady` (sequential) to run after `copyToWorktree` and before sandbox creation, so that I can prep files (e.g. `cp .env.example .env`) before the container starts.
65. As a developer, I want `sandbox.onSandboxReady` (parallel with `host.onSandboxReady`) to run after the sandbox is up, so that I can `npm install` inside the container without blocking host-side observability hooks.
66. As a developer, I want host hooks to be `{ command, timeoutMs? }` only — no `sudo`, no `cwd` — so that the surface is intentionally minimal (use `cd` or inline env in the command).
67. As a developer, I want sandbox hooks to support `sudo: true`, so that I can `apt-get install` build deps inside the container.
68. As a developer, I want hooks to default to a 60s timeout and accept `timeoutMs` per-hook, so that long installs (e.g. 300s for `npm install`) don't cap the default.
69. As a developer, I want any non-zero hook exit to fail setup immediately, so that broken setup never reaches the agent.
70. As a developer, I want the run's `signal` threaded into all hooks, so that aborting cancels in-flight installs.
71. As a developer, I want both **agent provider** and **sandbox provider** to accept an optional `env: Record<string, string>`, so that providers can declare their own credential needs at the type level.
72. As a developer, I want overlap between agent-provider env and sandbox-provider env to throw at launch, so that ambiguous "who owns this key" cases surface loudly.
73. As a developer, I want a 4-source env precedence (lowest → highest: `process.env`, `.sanddune/.env`, provider env, `RunOptions.env`), so that call-site overrides always win and provider declarations always beat ambient env.
74. As a developer, I want `RunOptions.env` to overlap freely with provider env, so that I have a no-friction call-site escape hatch.
75. As a developer, I want `cwd` and `promptFile` to resolve relative paths against `process.cwd()` (caller's perspective), so that scripts moved between directories behave predictably.
76. As a developer, I want `copyToWorktree` to resolve relative paths against `cwd` (target repo's perspective), so that `node_modules` and `.env.example` are conceptually attached to the repo being worked on.
77. As a developer, I want `copyToWorktree` rejected with `branchStrategy: { type: "head" }`, so that I can't accidentally try to copy into a worktree that doesn't exist.
78. As a developer, I want `name` to be prefixed in log output, so that parallel runs don't visually braid in a shared log stream.
79. As a developer, I want **log-to-file mode** as the default for programmatic use, so that `run()` doesn't try to repaint a TTY my orchestrator doesn't have.
80. As a developer, I want **terminal mode** opt-in via `logging: { type: "stdout" }`, so that interactive use gets spinners and styled summaries.
81. As a developer, I want sanddune to print a `tail -f` command for the run log, so that I can follow output without guessing the path.
82. As a developer, I want an `onAgentStreamEvent` callback on the `logging` option to receive each text/toolCall event with `iteration` and `timestamp`, so that I can forward agent output to my observability system without re-implementing parsing.
83. As a developer, I want errors thrown by `onAgentStreamEvent` swallowed, so that a broken forwarder cannot kill my run.
84. As a developer, I want `result.logFilePath` populated only in file mode, so that my code can clearly check `if (result.logFilePath)` before referencing it.
85. As a developer, I want `IterationResult.usage` (input/output/cache tokens) parsed from captured Claude sessions, so that I can budget runs and surface cost without parsing JSONL myself.
86. As a developer, I want `usage` to be `undefined` when capture is off or the agent provider doesn't parse it, so that the absence is unambiguous.
87. As a developer, I want `iterations.length` to give me the iteration count, so that I don't need a separate counter.
88. As a developer, I want commits returned as `{ sha }[]`, so that I can build a PR description, run a review pipeline, or stage further work.
89. As a developer, I want `sandbox.run()` to remain usable after an abort fires mid-iteration, so that I can call `.run()` again with a fresh signal or `.close()` to tear down — partial work is left for me to inspect with `git status`.
90. As a developer, I want `sanddune docker build-image` (and `podman build-image`) to rebuild from an existing `.sanddune/`, so that I can iterate on my Dockerfile without re-scaffolding.
91. As a developer, I want a `--dockerfile` / `--containerfile` flag to point at a custom file with build context = cwd, so that I can prototype image variants without touching `.sanddune/`.
92. As a developer, I want `sanddune docker remove-image` (and `podman remove-image`) to tear down the image cleanly, so that I can free disk space when done.
93. As a developer, I want `sanddune init --image-name ...`, `--agent`, `--model`, `--template` flags to skip interactive prompts, so that I can script init in CI or onboarding tooling.
94. As a developer picking Podman during init, I want a `Containerfile` written instead of `Dockerfile` and Podman-namespaced CLI commands, so that the scaffold matches the runtime I picked.
95. As a developer, I want the default Dockerfile to install Node 22, git, curl, jq, GitHub CLI, Claude Code CLI, and a non-root `agent` user, so that the basic loop works without further config.
96. As a developer customizing the Dockerfile, I want explicit guidance to keep a non-root user, git, gh, and the Claude Code CLI on PATH, so that I don't accidentally break the contract.
97. As a developer, I want Docker/Podman providers to accept `mounts` (with absolute, `~`, and cwd-relative paths), so that I can mount caches like `~/.npm` read-only or share `data/` directories.
98. As a developer, I want Docker/Podman providers to accept `network` (single name or array), so that my container can reach internal services on a private Docker network.
99. As a developer, I want Podman support to handle SELinux labels correctly, so that bind mounts work on Fedora/RHEL hosts without manual `chcon`.
100. As a Vercel user, I want `vercel()` to provision a Firecracker microVM via `@vercel/sandbox`, so that I can fan out cloud isolated runs without managing infra.
101. As a developer, I want a documented `name` field on every provider for telemetry/error messages, so that "the docker provider failed" reads cleanly in logs.
102. As a custom-provider author, I want a reference implementation list (`docker.ts`, `podman.ts`, `vercel.ts`, `test-isolated.ts`) called out in the README, so that I can copy the closest match.
103. As a maintainer, I want isolated providers to live in the type system from day one (even if Vercel lands first), so that custom isolated providers compile against a stable shape.
104. As a developer, I want `claudeCode(model, { effort })` to accept `"low" | "medium" | "high" | "max"`, with `"max"` Opus-only, so that reasoning effort is a one-line config.
105. As a developer, I want `codex(model, { effort })` to accept `"low" | "medium" | "high" | "xhigh"`, mapped to `model_reasoning_effort`, so that Codex tuning matches its native API.
106. As an automation engineer, I want `timeouts: { copyToWorktreeMs }` (default 60s) to override built-in lifecycle timeouts, so that large repos don't fail on the copy step.
107. As a developer running an interactive session, I want `cwd: "/path/to/other-repo"` accepted on `interactive()`, so that I can drop into a TUI in a repo other than `process.cwd()`.
108. As a maintainer, I want **agent invoker** to be an Effect `Context.Tag` service that wraps the raw agent call, so that tests can substitute a recording or scripted fake without running a real agent.
109. As a maintainer, I want every iteration to produce at most one commit by convention (the agent may emit multiple, sanddune captures all), so that callers reasoning per-iteration can rely on a stable shape.
110. As a developer, I want sanddune to validate the prerequisites (git installed, sandbox provider available) on first use, so that misconfiguration produces an actionable error before the agent starts.

## Implementation Decisions

### Modules

The library is decomposed into the following modules; each has a clear in/out and is documented in `CONTEXT.md` vocabulary.

- **Branch strategy resolver** — Pure function: `(branchStrategy, providerType, hostBranch) → worktree plan`. No I/O. Encodes the compatibility matrix (e.g. isolated + head is rejected).
- **Worktree manager** — Owns `.sanddune/worktrees/` lifecycle: create, lock, detect dirty state via `git status --porcelain`, preserve-or-remove on close, perform `copyToWorktree`. The only module that calls `git worktree`.
- **Sandbox provider abstraction** — `createBindMountSandboxProvider` / `createIsolatedSandboxProvider` factories returning the sandbox handle contract: `exec` (with optional `onLine` streaming), `copyFileIn` (bind only) / `copyIn` (isolated), `copyFileOut`, `close`, `worktreePath`. Every `exec` returns `{ stdout, stderr, exitCode }`.
- **Agent provider abstraction** — `claudeCode`, `codex`, `opencode`, `pi`. Each declares its required env keys at the type level, builds the per-iteration command, parses streamed stdout into `text` / `toolCall` events, and (for Claude) extracts `sessionId` and `usage` from the session record.
- **Agent invoker** — Effect `Context.Tag` that wraps the agent call for one iteration. The seam tests substitute with scripted fakes — production code never reaches the real subprocess in unit tests.
- **Iteration loop** — Drives up to `maxIterations` calls through the agent invoker, accumulating `IterationResult[]`. Substring-matches `completionSignal` (string or string[]) against the merged stream and exits early on first match. Threads idle timeout as a synthesized abort with sanddune-defined reason; the same handle stays usable after timeout.
- **Prompt pipeline** — Three stages: (1) resolution (inline string vs `promptFile`); (2) host-side `{{KEY}}` substitution against `promptArgs` ∪ built-ins; (3) sandbox-side `` !`command` `` expansion (parallel) inside the sandbox after `sandbox.onSandboxReady`. Inline prompts skip stages 2 and 3 entirely. `promptArgs` with an inline prompt errors. Built-ins (`SOURCE_BRANCH`, `TARGET_BRANCH`) cannot be overridden. Missing keys error in `run()` and prompt the user in `interactive()`. Unused keys warn.
- **Hook runner** — Runs `host.onWorktreeReady` sequentially after `copyToWorktree`, then runs `host.onSandboxReady` ∥ `sandbox.onSandboxReady` in parallel after sandbox creation. Threads abort signal; non-zero exit fails fast; per-hook `timeoutMs` (default 60_000).
- **Env var resolver** — Layers four sources in order: `process.env` → `.sanddune/.env` → agent provider `env` ∪ sandbox provider `env` (must be disjoint, throws on overlap) → `RunOptions.env` (free to overlap, last-write-wins).
- **Session capture** — Claude-only. After each iteration, transfers session JSONL from sandbox to host at `~/.claude/projects/<encoded-path>/sessions/<id>.jsonl`, rewriting `cwd` fields to the host repo root. For `resumeSession`, the reverse: validates host file exists, transfers in with `cwd` rewritten to the sandbox path, passes `--resume <id>` on iteration 1 only. Failure is logged but does not fail the run; `IterationResult.sessionFilePath` is left `undefined`.
- **Logging engine** — Two display modes: log-to-file (default, writes to `.sanddune/logs/`, prints `tail -f`) and terminal (spinners, styled summaries). Both invoke `onAgentStreamEvent` callback on each agent stream event with `{ iteration, timestamp, ... }`. Callback is sync, fire-and-forget, errors swallowed.
- **Public API surface** — `run()`, `createSandbox()`, `createWorktree()`, `interactive()`. Layered ownership: `createSandboxFromWorktree` is an internal helper shared between top-level `createSandbox()` (which owns worktree + sandbox) and `wt.createSandbox()` (sandbox only — worktree owned by parent `Worktree`). Both return the same `Sandbox` type; the ownership contract is documented, not type-encoded.
- **`init` CLI** — Interactive prompts for agent / backlog manager / template, performs **template argument substitution** on Dockerfile and scaffold `.md` files (e.g. `{{BACKLOG_MANAGER_TOOLS}}`), and builds the image. Refuses to run if `.sanddune/` already exists.
- **`build-image` / `remove-image` CLI** — Provider-namespaced (`sanddune docker build-image`, `sanddune podman build-image`, etc.). `--image-name` defaults to `sanddune:<repo-dir-name>`.

### Public API surface

| Entry point        | Returns                  | Branch strategies allowed                        | Sandbox providers allowed                |
| ------------------ | ------------------------ | ------------------------------------------------ | ---------------------------------------- |
| `run()`            | `RunResult`              | per provider default + explicit                  | bind-mount, isolated (no `noSandbox()`)  |
| `createSandbox()`  | `Sandbox`                | implicit `branch` (single-branch by construction) | bind-mount, isolated (no `noSandbox()`) |
| `createWorktree()` | `Worktree`               | `branch`, `merge-to-head` (no `head`)            | n/a (sandbox passed to `wt.run()`)       |
| `interactive()`    | `InteractiveResult`      | provider default only (no per-call override)     | all three including `noSandbox()`        |

### Branch strategy compatibility matrix

| Strategy        | Bind-mount | Isolated  | No-sandbox |
| --------------- | ---------- | --------- | ---------- |
| `head`          | Default    | Rejected  | Default    |
| `merge-to-head` | Allowed    | Default   | Allowed    |
| `branch`        | Allowed    | Allowed   | Allowed    |

Rejection is at the *type level* where possible (isolated + head; `noSandbox()` + `run()`).

### Result contracts

`RunResult` — `iterations: IterationResult[]`, `completionSignal?: string`, `stdout: string`, `commits: { sha }[]`, `branch: string`, `logFilePath?: string` (file mode only).

`IterationResult` — `sessionId?: string`, `sessionFilePath?: string`, `usage?: IterationUsage` (`inputTokens`, `cacheCreationInputTokens`, `cacheReadInputTokens`, `outputTokens`).

`CloseResult` — `preservedWorktreePath?: string` (set only when worktree was dirty).

### Path resolution rule

- `cwd`, `promptFile` → resolve relative to `process.cwd()` (caller's perspective).
- `copyToWorktree` items → resolve relative to `cwd` (target repo's perspective).

This split is non-obvious and is documented in `CONTEXT.md`.

### Aborted runs and reusability

When `signal` fires mid-iteration: the agent subprocess is killed, the call rejects with `signal.reason` verbatim, the worktree is left in whatever state the killed agent produced (no rollback), and the `Sandbox` handle remains usable. Idle timeout uses the same mechanism with a sanddune-defined reason.

### Resume semantics

`resumeSession` is a top-level `run()` concern only — it's about starting a fresh sandbox from a prior session. Long-lived sandboxes don't chain Claude session state across `sandbox.run()` calls; `sandbox.run()` rejects `resumeSession`. Resume + `maxIterations > 1` throws before sandbox creation.

### Capture is best-effort

Capture failure logs a warning, leaves `sessionFilePath` undefined, and does **not** fail the run. Callers requiring a captured session must check `sessionFilePath` themselves.

### Provider env disjoint rule

Agent and sandbox provider `env` maps must be disjoint — overlap throws at launch because neither provider has authority over a shared key. `RunOptions.env` is the call-site escape hatch and is allowed to overlap.

### Custom-provider DX

Custom sandbox providers implement one of two factories. The bind-mount path is the simpler one (sanddune handles worktrees and commit extraction). Isolated providers implement `copyIn` (file-or-dir) and `copyFileOut`. `name` is required for telemetry. Reference implementations: `src/sandboxes/docker.ts`, `src/sandboxes/podman.ts`, `src/sandboxes/vercel.ts`, `src/sandboxes/test-isolated.ts`.

### CLI surface

| Command                                               | Purpose                                                       |
| ----------------------------------------------------- | ------------------------------------------------------------- |
| `sanddune init`                                       | Scaffold `.sanddune/`, build image, refuse if dir exists      |
| `sanddune docker build-image` / `podman build-image`  | Rebuild image; supports `--dockerfile` / `--containerfile`    |
| `sanddune docker remove-image` / `podman remove-image`| Remove image                                                  |

### ADRs already in scope

The library inherits 11 ratified architectural decisions in `docs/adr/`: per-step timeouts (0001), `cwd` option (0002), reuse-worktree-by-default (0003), abort-signal on `run()`/`interactive()` (0004), remove chown UID alignment (0005a), usage as raw tokens (0005b), git worktree mounts on Windows (0006), worktree locking (0007), inline prompts skip processing (0008), branch-strategy per call (0009), layered sandbox creation (0010), sandbox-survives-abort (0011). Implementation must respect these.

## Testing Decisions

### What makes a good test

- **Test external behavior, not implementation details.** A test that breaks when an internal helper renames is over-fitted; a test that breaks when the public contract changes is exactly right.
- **Prefer the agent invoker seam.** Unit tests that exercise the iteration loop, completion-signal matching, idle timeout, and stream forwarding should swap the real agent provider for a scripted fake via the `Context.Tag` seam — never spawn a real agent subprocess in unit tests.
- **Prefer the isolated sandbox seam.** Tests that exercise prompt pipelines, hooks, and capture should use the `test-isolated.ts` provider (a temp-directory-backed isolated provider) so the same code path is exercised without Docker or Podman.
- **Spawn real Docker/Podman only in integration tests.** Mark these so they can be opted out in CI when the runtime is missing.
- **Separate pure-logic tests from I/O tests.** The branch strategy resolver, prompt pipeline (substitution stage), and env var resolver are pure functions and should have unit tests with table-driven cases.
- **Test error paths as deliberately as success paths.** Missing `{{KEY}}`, overlapping provider env, `head` on isolated, `resumeSession` with `maxIterations > 1`, hook timeout, capture failure during a successful run — each is a distinct contract and gets a test.

### Modules under unit test

All seven deep modules get focused unit tests:

1. **Branch strategy resolver** — Table-driven cases: each (strategy × provider type × host branch) combination, including the rejected ones (isolated + head).
2. **Worktree manager** — Real git in a temp repo: create/lock/dirty-detect/preserve-or-remove paths; `copyToWorktree` with relative-to-`cwd` resolution; concurrent-creation lock contention (per ADR 0007).
3. **Prompt pipeline** — Inline bypass, `{{KEY}}` substitution (host-side, before expansion), built-in injection, missing-key error, unused-key warning, `promptArgs` + inline error, `` !` `` inside `promptArgs` value treated as inert text. Stage 3 (shell expansion) tested against a fake sandbox handle.
4. **Iteration loop** — Scripted-stream fakes via the agent invoker: completion signal substring match (single + array, first-match-wins), `maxIterations` bound, idle timeout fires after silence and resets on output, abort threading.
5. **Env var resolver** — Layering precedence across the four sources; disjoint enforcement throws; `RunOptions.env` overlap allowed.
6. **Session capture** — Fake isolated sandbox + temp host directory: JSONL transfer + `cwd` rewrite (host-side and sandbox-side), `--resume` validation (`maxIterations > 1` throws, missing host file errors), best-effort failure (capture error → warning + run still succeeds).
7. **Hook runner** — Sequential host hooks (ordering preserved), parallel host/sandbox `onSandboxReady` (both started, neither blocks the other), per-hook timeout, signal threading cancels in-flight commands, non-zero exit fails fast.

### Integration tests

- **End-to-end against `test-isolated.ts`** — `run()`, `createSandbox()` reuse, `createWorktree()` + `wt.run()` + `wt.createSandbox()`, `interactive()` (with a scripted "TUI" fake). Validates the public API surface and the layered ownership contract for `close()`.
- **Real Docker** — A small smoke suite that spawns the actual Docker provider on a fixture repo and confirms a commit lands; gated behind a CI flag so contributors without Docker can skip.

### Prior art

This is a green-field repo, so there is no prior art inside it yet. Reference patterns from the broader ecosystem:

- **Effect `Context.Tag` seams** — the canonical Effect pattern for swapping a service in tests; the agent invoker uses it.
- **Vitest + temp dirs for git** — standard Node test pattern; the worktree manager uses it.
- **Scripted-stream fakes** — generator-based fakes that yield text/toolCall events deterministically; the iteration loop uses these.

## Out of Scope

- **Implementing isolated sandbox providers beyond `test-isolated.ts` and `vercel.ts`.** The type system supports them from day one, but other isolated providers (e.g. Fly.io, gVisor, Kata) are user-built or follow-on work.
- **Bundle/patch sync for isolated providers.** Mentioned in `CONTEXT.md` as a future option but not part of v1.
- **Built-in observability backends.** sanddune ships the `onAgentStreamEvent` hook; integrations with Datadog, OpenTelemetry, etc. are downstream.
- **Built-in retry / failure-recovery policies.** `sandbox.run()` is reusable after abort, but sanddune does not roll back partial edits or commits — callers retry from a clean slate themselves.
- **Multi-agent coordination at a single iteration.** One iteration = one agent invocation. Pipelines (implement → review → revise) compose at the `run()` / `sandbox.run()` level.
- **Web UI / dashboard.** sanddune is a library + CLI. Dashboards are downstream products.
- **Non-git VCS.** Mercurial, jj, etc. are out of scope. Worktrees, branches, commits, and `git status --porcelain` are assumed throughout.
- **Windows support beyond what ADR 0006 specifies.** WSL is the assumed Windows path; native Windows is best-effort.
- **Built-in templates beyond the five listed.** `blank`, `simple-loop`, `sequential-reviewer`, `parallel-planner`, `parallel-planner-with-review` ship; everything else is user-authored.
- **Backlog managers beyond GitHub Issues and Beads.** Other tracker integrations are out of scope for v1; the **backlog manager** abstraction is open for extension but only two implementations ship.
- **Server-side / hosted sanddune.** Library and CLI only; no daemon, no SaaS.

## Further Notes

- **Domain vocabulary is authoritative.** All implementation, code review, docs, and PRs use the terms defined in `CONTEXT.md` (sanddune, sandbox, host, agent, sandbox provider, branch strategy, worktree, source/target branch, agent invoker, iteration, task, completion signal, prompt template, prompt argument, prompt expansion, shell expression, etc.). Avoid retired terms ("workspace", "worktree mode", "the tool").
- **ADRs guard prior decisions.** When implementing, re-read the ADRs in `docs/adr/` rather than re-debating settled questions. New architectural choices that don't fit an existing ADR get a new ADR before implementation.
- **Effect is in the stack.** The agent invoker is an Effect `Context.Tag`. The codebase will follow Effect conventions for service registration, error channels, and resource management; introduce new services as `Context.Tag`s where the swap-in-tests benefit applies.
- **`tsgo` for builds, `vitest` for tests, `npm run typecheck` for type-checking.** Per `Claude.md`. CI mirrors these.
- **Changesets for user-facing changes.** Pre-1.0, all changesets are `patch`. Use `package.json#name` as the changeset name.
- **README and CONTEXT.md drift is a real risk.** The brief, the README, and `CONTEXT.md` overlap heavily. When changing public-facing behavior, update both — don't let one outpace the other.
- **This PRD covers v1 surface area, not the full implementation order.** The decomposition into deep modules is the seam for sequencing follow-on issues; expect this PRD to be sliced into many issues (one per module + integration tests + CLI commands) by the `to-issues` skill.


Entry point	Returns	Branch strategies allowed	Sandbox providers allowed
`run()`	`RunResult`	per provider default + explicit	bind-mount, isolated (no `noSandbox()`)
`createSandbox()`	`Sandbox`	implicit `branch` (single-branch by construction)	bind-mount, isolated (no `noSandbox()`)
`createWorktree()`	`Worktree`	`branch`, `merge-to-head` (no `head`)	n/a (sandbox passed to `wt.run()`)
`interactive()`	`InteractiveResult`	provider default only (no per-call override)	all three including `noSandbox()`

Command	Purpose
`sanddune init`	Scaffold `.sanddune/`, build image, refuse if dir exists
`sanddune docker build-image` / `podman build-image`	Rebuild image; supports `--dockerfile` / `--containerfile`
`sanddune docker remove-image` / `podman remove-image`	Remove image

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PRD: sanddune v1 — orchestrate AI coding agents in isolated sandboxes #1

Problem Statement

Solution

User Stories

Implementation Decisions

Modules

Public API surface

Branch strategy compatibility matrix

Result contracts

Path resolution rule

Aborted runs and reusability

Resume semantics

Capture is best-effort

Provider env disjoint rule

Custom-provider DX

CLI surface

ADRs already in scope

Testing Decisions

What makes a good test

Modules under unit test

Integration tests

Prior art

Out of Scope

Further Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Strategy	Bind-mount	Isolated	No-sandbox
`head`	Default	Rejected	Default
`merge-to-head`	Allowed	Default	Allowed
`branch`	Allowed	Allowed	Allowed

PRD: sanddune v1 — orchestrate AI coding agents in isolated sandboxes #1

Description

Problem Statement

Solution

User Stories

Implementation Decisions

Modules

Public API surface

Branch strategy compatibility matrix

Result contracts

Path resolution rule

Aborted runs and reusability

Resume semantics

Capture is best-effort

Provider env disjoint rule

Custom-provider DX

CLI surface

ADRs already in scope

Testing Decisions

What makes a good test

Modules under unit test

Integration tests

Prior art

Out of Scope

Further Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions