Skip to content

PRD: sanddune v1 — orchestrate AI coding agents in isolated sandboxes #1

@pyadav

Description

@pyadav

Problem Statement

I want to run AI coding agents on many issues at once, without trashing my working tree, without supervising each one, and without writing the same orchestration glue in every project. Today I either:

  • Run an agent live in my repo and babysit it (no parallelism, no isolation).
  • Hand-roll a Docker setup per project, copy-paste prompt loops, and discover at 2am that one agent stomped another's branch.
  • Ship the agent into a remote VM and lose the ability to ergonomically inspect commits on my host.

I need a primitive that takes "a prompt, an agent, a sandbox" and reliably returns "a commit on a branch" — so I can build review pipelines, parallel planners, and AFK loops on top.

Solution

A TypeScript library, sanddune, that orchestrates agents inside sandboxes. The user writes a prompt, calls sanddune.run(), and gets a commit on a branch. The library is provider-agnostic on both axes:

  • Sandbox providers — Docker, Podman, Vercel Firecracker microVMs, no-sandbox, or a custom provider implementing a single interface. Two flavors: bind-mount (host directory mounted in) and isolated (sandbox has its own filesystem; sanddune syncs in/out).
  • Agent providers — Claude Code, Codex, opencode, pi, or a custom provider.

A run goes through three phases: setup (worktree + sandbox + hooks), agent loop (invoke, stream, repeat up to maxIterations or until a completionSignal fires), teardown (capture commits, tear down sandbox).

Four programmatic entry points expose increasing levels of control: run() (one-shot), createSandbox() (reusable sandbox on one branch), createWorktree() (worktree as independent lifecycle), interactive() (TUI session, sync). A CLI (sanddune init, sanddune docker build-image, etc.) handles scaffolding and image lifecycle.

User Stories

  1. As a developer, I want to install sanddune as a dev dependency and scaffold a sandbox config with one command, so that I can start running agents in the same afternoon I learned the library exists.
  2. As a developer, I want sanddune init to refuse to overwrite an existing .sanddune/ directory, so that I never lose customizations to my Dockerfile or prompt.
  3. As a developer, I want to pick my agent (Claude Code / Codex / opencode / pi) and template (blank, simple-loop, sequential-reviewer, parallel-planner, parallel-planner-with-review) interactively during init, so that I can scaffold the right starter without remembering flag syntax.
  4. As a developer, I want sanddune init to scaffold main.mts when my package.json lacks "type": "module" and main.ts when it does, so that the scaffold runs without ESM/CJS friction.
  5. As a developer, I want a prompt.md file scaffolded into .sanddune/ as a convention (not a magic path sanddune reads on its own), so that I can rename it, ignore it, or reference it explicitly via promptFile.
  6. As a developer, I want to call run() with agent, sandbox, and prompt and have it commit to my repo and tear the sandbox down, so that I can run an agent in three lines of code.
  7. As a developer, I want bind-mount providers to default to the head branch strategy, so that the fastest path during local development is also the default.
  8. As an automation engineer, I want merge-to-head as the default for isolated providers, so that AFK runs can never trash my working tree on failure.
  9. As a developer, I want to opt into the branch strategy with an explicit branch name, so that I can wire sanddune into a PR-creation pipeline.
  10. As a developer, I want head to be a compile-time error on isolated providers, so that I can't ship an impossible config.
  11. As a developer, I want head to be a compile-time error on createWorktree(), so that the type system rules out worktree-less worktrees.
  12. As a developer, I want to import sandbox providers from explicit subpaths (e.g. @missingstudio/sanddune/sandboxes/docker), so that my bundle doesn't pull in every provider's dependencies.
  13. As a developer, I want noSandbox() to be accepted only by interactive() and wt.interactive(), so that I can't accidentally launch an unsupervised AFK run on the host.
  14. As a developer, I want noSandbox() to not pass --dangerously-skip-permissions to the agent, so that Claude Code's normal permission prompts stay active during my interactive sessions.
  15. As a developer, I want to write a custom sandbox provider by implementing one factory (createBindMountSandboxProvider or createIsolatedSandboxProvider), so that I can target any container runtime, VM, or sandbox service my org standardizes on.
  16. As a developer, I want every exec call to return { stdout, stderr, exitCode }, so that custom providers have no ambiguity about their contract.
  17. As a developer, I want streaming output via an optional onLine callback on exec, so that my custom provider can plug into sanddune's logging without buffering the entire stream.
  18. As an automation engineer, I want to call run() once and get back result.iterations, result.commits, result.branch, and result.completionSignal, so that my downstream code can decide what to do without re-parsing logs.
  19. As a developer, I want a default completionSignal of <promise>COMPLETE</promise>, so that my prompts and the library agree on a convention out of the box.
  20. As a developer, I want to override completionSignal per run with a string or a list of strings, so that I can model multiple terminal states (e.g. TASK_COMPLETE vs TASK_ABORTED).
  21. As a developer, I want substring-based detection of completion signals against the agent's output stream, so that I don't need a structured protocol — any unique sentinel works.
  22. As a developer, I want result.completionSignal to tell me which signal matched (or undefined if none), so that I can branch on the agent's terminal state.
  23. As a developer, I want sanddune to never inject the completion signal into my prompt, so that the convention stays a convention I write into my prompt deliberately.
  24. As an automation engineer, I want maxIterations to bound any agent loop, so that runaway agents don't burn budget unbounded.
  25. As a developer, I want idleTimeoutSeconds (default 600) to abort the run if the agent goes silent, so that hung subprocesses don't keep the sandbox alive forever.
  26. As a developer, I want idleTimeoutSeconds to reset on every agent output event, so that long-but-active iterations aren't killed prematurely.
  27. As a developer, I want to pass an AbortSignal to run() and interactive(), so that I can cancel from a CLI signal handler or a parent supervisor.
  28. As a developer, I want aborting to kill the agent subprocess and any in-flight hooks immediately, so that cancellation is responsive.
  29. As a developer, I want aborted runs to preserve the worktree on disk and reject with signal.reason verbatim, so that I can inspect partial work and propagate the cancellation cause.
  30. As a developer, I want createSandbox() to give me a reusable sandbox on a single branch, so that multi-step pipelines (implement → review → revise) don't pay container startup cost per step.
  31. As a developer, I want createSandbox() to preserve installed dependencies and build artifacts between calls, so that an npm install in step 1 doesn't repeat in step 2.
  32. As a developer, I want await using sandbox = await createSandbox(...) to auto-tear-down via Symbol.asyncDispose, so that I never leak containers when an exception escapes the block.
  33. As a developer, I want manual sandbox.close() to return { preservedWorktreePath } set when the worktree was dirty, so that I know where to look on disk after a failure.
  34. As a developer, I want clean worktrees auto-removed and dirty worktrees preserved, so that successful runs don't leave clutter and failed runs don't lose work.
  35. As a developer, I want createWorktree() to give me a worktree as a first-class lifecycle, so that I can run an interactive session, then a sandboxed AFK agent, against the same worktree.
  36. As a developer, I want createWorktree() to reject head at compile time, so that "no worktree" can't sneak past type checks.
  37. As a developer, I want wt.createSandbox() to be the conceptual primitive (with top-level createSandbox() as a bundled convenience), so that the underlying ownership model is explicit when I need it.
  38. As a developer, I want split ownership: sandbox.close() from wt.createSandbox() tears down only the container; wt.close() cleans the worktree, so that I can keep the worktree alive across multiple sandbox instances.
  39. As a developer, I want top-level createSandbox()'s sandbox.close() to tear down both container and worktree, so that the convenience case has matching convenience cleanup.
  40. As a developer, I want interactive() to launch the agent's TUI inside a sandbox or on the host, so that I can drop into a fresh shell with my repo already in place.
  41. As a developer, I want interactive() to return control synchronously when the user exits the TUI, so that orchestration code can resume cleanly.
  42. As a developer, I want top-level interactive() to always use the provider's default branch strategy, so that the simple case is unambiguous; for non-default strategies I'll route through createWorktree() + wt.interactive().
  43. As a prompt author, I want prompt: "..." to pass my string through to the agent literally, so that I never have an unexpected {{ or !` interpretation.
  44. As a prompt author, I want passing both prompt and promptFile (or promptArgs with prompt) to error, so that the contract is unambiguous.
  45. As a prompt author, I want promptFile: "./path.md" to enable substitution and shell expansion, so that I can write reusable templates with embedded context.
  46. As a prompt author, I want {{KEY}} placeholders substituted from promptArgs before shell expansion, so that I can embed args inside !`gh issue view {{ISSUE_NUMBER}}`.
  47. As a prompt author, I want a missing {{KEY}} in run() (AFK mode) to error with the missing key name, so that typos surface immediately.
  48. As a prompt author, I want a missing {{KEY}} in interactive() to prompt me at the terminal, so that I can recover without restarting the session.
  49. As a prompt author, I want unused promptArgs keys to log a warning, not fail, so that scripts that pass shared arg maps still work.
  50. As a prompt author, I want {{SOURCE_BRANCH}} and {{TARGET_BRANCH}} injected automatically into every prompt, so that branching info is one substitution away in any template.
  51. As a prompt author, I want passing SOURCE_BRANCH or TARGET_BRANCH in promptArgs to error, so that built-in arguments are unambiguously authoritative.
  52. As a prompt author, I want !`command` expressions evaluated in parallel, so that fetching multiple bits of context (issues, commits, status) doesn't serialize.
  53. As a prompt author, I want shell expressions evaluated inside the sandbox after sandbox.onSandboxReady hooks, so that they see the same repo state the agent will see.
  54. As a prompt author, I want a non-zero exit from any shell expression to fail the run immediately, so that I never hand a half-rendered prompt to the agent.
  55. As a prompt author, I want !`...` patterns inside promptArgs values treated as inert text, so that I can pass user-authored content (issue titles, PR descriptions) without command-injection risk.
  56. As a Claude Code user, I want sanddune to capture the per-iteration session JSONL to my host at ~/.claude/projects/<encoded-path>/sessions/<id>.jsonl, so that claude --resume works natively after the run.
  57. As a Claude Code user, I want cwd fields inside captured session files rewritten to my host repo root, so that resume reopens in the right directory.
  58. As a Claude Code user, I want captureSessions: false to opt out of capture, so that one-off runs don't pollute my sessions directory.
  59. As a Claude Code user, I want capture failure to log a warning and leave sessionFilePath undefined, but not fail the run, so that disk-full or permissions issues don't kill an otherwise-successful agent.
  60. As a Claude Code user, I want resumeSession: "<id>" to validate the session file exists, transfer it into the sandbox with cwd rewritten, and pass --resume <id> on iteration 1 only, so that I can continue prior work without ceremony.
  61. As a Claude Code user, I want resumeSession with maxIterations > 1 to throw before sandbox creation, so that the contract (resume only iteration 1) is enforced loudly.
  62. As a Claude Code user, I want resumeSession rejected on sandbox.run(), so that long-lived sandboxes — which don't chain Claude session state across calls — can't pretend to.
  63. As a developer using a non-Claude agent, I want captureSessions and resumeSession to be no-ops, so that my agent provider doesn't fail on Claude-specific options.
  64. As a developer, I want host.onWorktreeReady (sequential) to run after copyToWorktree and before sandbox creation, so that I can prep files (e.g. cp .env.example .env) before the container starts.
  65. As a developer, I want sandbox.onSandboxReady (parallel with host.onSandboxReady) to run after the sandbox is up, so that I can npm install inside the container without blocking host-side observability hooks.
  66. As a developer, I want host hooks to be { command, timeoutMs? } only — no sudo, no cwd — so that the surface is intentionally minimal (use cd or inline env in the command).
  67. As a developer, I want sandbox hooks to support sudo: true, so that I can apt-get install build deps inside the container.
  68. As a developer, I want hooks to default to a 60s timeout and accept timeoutMs per-hook, so that long installs (e.g. 300s for npm install) don't cap the default.
  69. As a developer, I want any non-zero hook exit to fail setup immediately, so that broken setup never reaches the agent.
  70. As a developer, I want the run's signal threaded into all hooks, so that aborting cancels in-flight installs.
  71. As a developer, I want both agent provider and sandbox provider to accept an optional env: Record<string, string>, so that providers can declare their own credential needs at the type level.
  72. As a developer, I want overlap between agent-provider env and sandbox-provider env to throw at launch, so that ambiguous "who owns this key" cases surface loudly.
  73. As a developer, I want a 4-source env precedence (lowest → highest: process.env, .sanddune/.env, provider env, RunOptions.env), so that call-site overrides always win and provider declarations always beat ambient env.
  74. As a developer, I want RunOptions.env to overlap freely with provider env, so that I have a no-friction call-site escape hatch.
  75. As a developer, I want cwd and promptFile to resolve relative paths against process.cwd() (caller's perspective), so that scripts moved between directories behave predictably.
  76. As a developer, I want copyToWorktree to resolve relative paths against cwd (target repo's perspective), so that node_modules and .env.example are conceptually attached to the repo being worked on.
  77. As a developer, I want copyToWorktree rejected with branchStrategy: { type: "head" }, so that I can't accidentally try to copy into a worktree that doesn't exist.
  78. As a developer, I want name to be prefixed in log output, so that parallel runs don't visually braid in a shared log stream.
  79. As a developer, I want log-to-file mode as the default for programmatic use, so that run() doesn't try to repaint a TTY my orchestrator doesn't have.
  80. As a developer, I want terminal mode opt-in via logging: { type: "stdout" }, so that interactive use gets spinners and styled summaries.
  81. As a developer, I want sanddune to print a tail -f command for the run log, so that I can follow output without guessing the path.
  82. As a developer, I want an onAgentStreamEvent callback on the logging option to receive each text/toolCall event with iteration and timestamp, so that I can forward agent output to my observability system without re-implementing parsing.
  83. As a developer, I want errors thrown by onAgentStreamEvent swallowed, so that a broken forwarder cannot kill my run.
  84. As a developer, I want result.logFilePath populated only in file mode, so that my code can clearly check if (result.logFilePath) before referencing it.
  85. As a developer, I want IterationResult.usage (input/output/cache tokens) parsed from captured Claude sessions, so that I can budget runs and surface cost without parsing JSONL myself.
  86. As a developer, I want usage to be undefined when capture is off or the agent provider doesn't parse it, so that the absence is unambiguous.
  87. As a developer, I want iterations.length to give me the iteration count, so that I don't need a separate counter.
  88. As a developer, I want commits returned as { sha }[], so that I can build a PR description, run a review pipeline, or stage further work.
  89. As a developer, I want sandbox.run() to remain usable after an abort fires mid-iteration, so that I can call .run() again with a fresh signal or .close() to tear down — partial work is left for me to inspect with git status.
  90. As a developer, I want sanddune docker build-image (and podman build-image) to rebuild from an existing .sanddune/, so that I can iterate on my Dockerfile without re-scaffolding.
  91. As a developer, I want a --dockerfile / --containerfile flag to point at a custom file with build context = cwd, so that I can prototype image variants without touching .sanddune/.
  92. As a developer, I want sanddune docker remove-image (and podman remove-image) to tear down the image cleanly, so that I can free disk space when done.
  93. As a developer, I want sanddune init --image-name ..., --agent, --model, --template flags to skip interactive prompts, so that I can script init in CI or onboarding tooling.
  94. As a developer picking Podman during init, I want a Containerfile written instead of Dockerfile and Podman-namespaced CLI commands, so that the scaffold matches the runtime I picked.
  95. As a developer, I want the default Dockerfile to install Node 22, git, curl, jq, GitHub CLI, Claude Code CLI, and a non-root agent user, so that the basic loop works without further config.
  96. As a developer customizing the Dockerfile, I want explicit guidance to keep a non-root user, git, gh, and the Claude Code CLI on PATH, so that I don't accidentally break the contract.
  97. As a developer, I want Docker/Podman providers to accept mounts (with absolute, ~, and cwd-relative paths), so that I can mount caches like ~/.npm read-only or share data/ directories.
  98. As a developer, I want Docker/Podman providers to accept network (single name or array), so that my container can reach internal services on a private Docker network.
  99. As a developer, I want Podman support to handle SELinux labels correctly, so that bind mounts work on Fedora/RHEL hosts without manual chcon.
  100. As a Vercel user, I want vercel() to provision a Firecracker microVM via @vercel/sandbox, so that I can fan out cloud isolated runs without managing infra.
  101. As a developer, I want a documented name field on every provider for telemetry/error messages, so that "the docker provider failed" reads cleanly in logs.
  102. As a custom-provider author, I want a reference implementation list (docker.ts, podman.ts, vercel.ts, test-isolated.ts) called out in the README, so that I can copy the closest match.
  103. As a maintainer, I want isolated providers to live in the type system from day one (even if Vercel lands first), so that custom isolated providers compile against a stable shape.
  104. As a developer, I want claudeCode(model, { effort }) to accept "low" | "medium" | "high" | "max", with "max" Opus-only, so that reasoning effort is a one-line config.
  105. As a developer, I want codex(model, { effort }) to accept "low" | "medium" | "high" | "xhigh", mapped to model_reasoning_effort, so that Codex tuning matches its native API.
  106. As an automation engineer, I want timeouts: { copyToWorktreeMs } (default 60s) to override built-in lifecycle timeouts, so that large repos don't fail on the copy step.
  107. As a developer running an interactive session, I want cwd: "/path/to/other-repo" accepted on interactive(), so that I can drop into a TUI in a repo other than process.cwd().
  108. As a maintainer, I want agent invoker to be an Effect Context.Tag service that wraps the raw agent call, so that tests can substitute a recording or scripted fake without running a real agent.
  109. As a maintainer, I want every iteration to produce at most one commit by convention (the agent may emit multiple, sanddune captures all), so that callers reasoning per-iteration can rely on a stable shape.
  110. As a developer, I want sanddune to validate the prerequisites (git installed, sandbox provider available) on first use, so that misconfiguration produces an actionable error before the agent starts.

Implementation Decisions

Modules

The library is decomposed into the following modules; each has a clear in/out and is documented in CONTEXT.md vocabulary.

  • Branch strategy resolver — Pure function: (branchStrategy, providerType, hostBranch) → worktree plan. No I/O. Encodes the compatibility matrix (e.g. isolated + head is rejected).
  • Worktree manager — Owns .sanddune/worktrees/ lifecycle: create, lock, detect dirty state via git status --porcelain, preserve-or-remove on close, perform copyToWorktree. The only module that calls git worktree.
  • Sandbox provider abstractioncreateBindMountSandboxProvider / createIsolatedSandboxProvider factories returning the sandbox handle contract: exec (with optional onLine streaming), copyFileIn (bind only) / copyIn (isolated), copyFileOut, close, worktreePath. Every exec returns { stdout, stderr, exitCode }.
  • Agent provider abstractionclaudeCode, codex, opencode, pi. Each declares its required env keys at the type level, builds the per-iteration command, parses streamed stdout into text / toolCall events, and (for Claude) extracts sessionId and usage from the session record.
  • Agent invoker — Effect Context.Tag that wraps the agent call for one iteration. The seam tests substitute with scripted fakes — production code never reaches the real subprocess in unit tests.
  • Iteration loop — Drives up to maxIterations calls through the agent invoker, accumulating IterationResult[]. Substring-matches completionSignal (string or string[]) against the merged stream and exits early on first match. Threads idle timeout as a synthesized abort with sanddune-defined reason; the same handle stays usable after timeout.
  • Prompt pipeline — Three stages: (1) resolution (inline string vs promptFile); (2) host-side {{KEY}} substitution against promptArgs ∪ built-ins; (3) sandbox-side !`command` expansion (parallel) inside the sandbox after sandbox.onSandboxReady. Inline prompts skip stages 2 and 3 entirely. promptArgs with an inline prompt errors. Built-ins (SOURCE_BRANCH, TARGET_BRANCH) cannot be overridden. Missing keys error in run() and prompt the user in interactive(). Unused keys warn.
  • Hook runner — Runs host.onWorktreeReady sequentially after copyToWorktree, then runs host.onSandboxReadysandbox.onSandboxReady in parallel after sandbox creation. Threads abort signal; non-zero exit fails fast; per-hook timeoutMs (default 60_000).
  • Env var resolver — Layers four sources in order: process.env.sanddune/.env → agent provider env ∪ sandbox provider env (must be disjoint, throws on overlap) → RunOptions.env (free to overlap, last-write-wins).
  • Session capture — Claude-only. After each iteration, transfers session JSONL from sandbox to host at ~/.claude/projects/<encoded-path>/sessions/<id>.jsonl, rewriting cwd fields to the host repo root. For resumeSession, the reverse: validates host file exists, transfers in with cwd rewritten to the sandbox path, passes --resume <id> on iteration 1 only. Failure is logged but does not fail the run; IterationResult.sessionFilePath is left undefined.
  • Logging engine — Two display modes: log-to-file (default, writes to .sanddune/logs/, prints tail -f) and terminal (spinners, styled summaries). Both invoke onAgentStreamEvent callback on each agent stream event with { iteration, timestamp, ... }. Callback is sync, fire-and-forget, errors swallowed.
  • Public API surfacerun(), createSandbox(), createWorktree(), interactive(). Layered ownership: createSandboxFromWorktree is an internal helper shared between top-level createSandbox() (which owns worktree + sandbox) and wt.createSandbox() (sandbox only — worktree owned by parent Worktree). Both return the same Sandbox type; the ownership contract is documented, not type-encoded.
  • init CLI — Interactive prompts for agent / backlog manager / template, performs template argument substitution on Dockerfile and scaffold .md files (e.g. {{BACKLOG_MANAGER_TOOLS}}), and builds the image. Refuses to run if .sanddune/ already exists.
  • build-image / remove-image CLI — Provider-namespaced (sanddune docker build-image, sanddune podman build-image, etc.). --image-name defaults to sanddune:<repo-dir-name>.

Public API surface

Entry point Returns Branch strategies allowed Sandbox providers allowed
run() RunResult per provider default + explicit bind-mount, isolated (no noSandbox())
createSandbox() Sandbox implicit branch (single-branch by construction) bind-mount, isolated (no noSandbox())
createWorktree() Worktree branch, merge-to-head (no head) n/a (sandbox passed to wt.run())
interactive() InteractiveResult provider default only (no per-call override) all three including noSandbox()

Branch strategy compatibility matrix

Strategy Bind-mount Isolated No-sandbox
head Default Rejected Default
merge-to-head Allowed Default Allowed
branch Allowed Allowed Allowed

Rejection is at the type level where possible (isolated + head; noSandbox() + run()).

Result contracts

RunResultiterations: IterationResult[], completionSignal?: string, stdout: string, commits: { sha }[], branch: string, logFilePath?: string (file mode only).

IterationResultsessionId?: string, sessionFilePath?: string, usage?: IterationUsage (inputTokens, cacheCreationInputTokens, cacheReadInputTokens, outputTokens).

CloseResultpreservedWorktreePath?: string (set only when worktree was dirty).

Path resolution rule

  • cwd, promptFile → resolve relative to process.cwd() (caller's perspective).
  • copyToWorktree items → resolve relative to cwd (target repo's perspective).

This split is non-obvious and is documented in CONTEXT.md.

Aborted runs and reusability

When signal fires mid-iteration: the agent subprocess is killed, the call rejects with signal.reason verbatim, the worktree is left in whatever state the killed agent produced (no rollback), and the Sandbox handle remains usable. Idle timeout uses the same mechanism with a sanddune-defined reason.

Resume semantics

resumeSession is a top-level run() concern only — it's about starting a fresh sandbox from a prior session. Long-lived sandboxes don't chain Claude session state across sandbox.run() calls; sandbox.run() rejects resumeSession. Resume + maxIterations > 1 throws before sandbox creation.

Capture is best-effort

Capture failure logs a warning, leaves sessionFilePath undefined, and does not fail the run. Callers requiring a captured session must check sessionFilePath themselves.

Provider env disjoint rule

Agent and sandbox provider env maps must be disjoint — overlap throws at launch because neither provider has authority over a shared key. RunOptions.env is the call-site escape hatch and is allowed to overlap.

Custom-provider DX

Custom sandbox providers implement one of two factories. The bind-mount path is the simpler one (sanddune handles worktrees and commit extraction). Isolated providers implement copyIn (file-or-dir) and copyFileOut. name is required for telemetry. Reference implementations: src/sandboxes/docker.ts, src/sandboxes/podman.ts, src/sandboxes/vercel.ts, src/sandboxes/test-isolated.ts.

CLI surface

Command Purpose
sanddune init Scaffold .sanddune/, build image, refuse if dir exists
sanddune docker build-image / podman build-image Rebuild image; supports --dockerfile / --containerfile
sanddune docker remove-image / podman remove-image Remove image

ADRs already in scope

The library inherits 11 ratified architectural decisions in docs/adr/: per-step timeouts (0001), cwd option (0002), reuse-worktree-by-default (0003), abort-signal on run()/interactive() (0004), remove chown UID alignment (0005a), usage as raw tokens (0005b), git worktree mounts on Windows (0006), worktree locking (0007), inline prompts skip processing (0008), branch-strategy per call (0009), layered sandbox creation (0010), sandbox-survives-abort (0011). Implementation must respect these.

Testing Decisions

What makes a good test

  • Test external behavior, not implementation details. A test that breaks when an internal helper renames is over-fitted; a test that breaks when the public contract changes is exactly right.
  • Prefer the agent invoker seam. Unit tests that exercise the iteration loop, completion-signal matching, idle timeout, and stream forwarding should swap the real agent provider for a scripted fake via the Context.Tag seam — never spawn a real agent subprocess in unit tests.
  • Prefer the isolated sandbox seam. Tests that exercise prompt pipelines, hooks, and capture should use the test-isolated.ts provider (a temp-directory-backed isolated provider) so the same code path is exercised without Docker or Podman.
  • Spawn real Docker/Podman only in integration tests. Mark these so they can be opted out in CI when the runtime is missing.
  • Separate pure-logic tests from I/O tests. The branch strategy resolver, prompt pipeline (substitution stage), and env var resolver are pure functions and should have unit tests with table-driven cases.
  • Test error paths as deliberately as success paths. Missing {{KEY}}, overlapping provider env, head on isolated, resumeSession with maxIterations > 1, hook timeout, capture failure during a successful run — each is a distinct contract and gets a test.

Modules under unit test

All seven deep modules get focused unit tests:

  1. Branch strategy resolver — Table-driven cases: each (strategy × provider type × host branch) combination, including the rejected ones (isolated + head).
  2. Worktree manager — Real git in a temp repo: create/lock/dirty-detect/preserve-or-remove paths; copyToWorktree with relative-to-cwd resolution; concurrent-creation lock contention (per ADR 0007).
  3. Prompt pipeline — Inline bypass, {{KEY}} substitution (host-side, before expansion), built-in injection, missing-key error, unused-key warning, promptArgs + inline error, !` inside promptArgs value treated as inert text. Stage 3 (shell expansion) tested against a fake sandbox handle.
  4. Iteration loop — Scripted-stream fakes via the agent invoker: completion signal substring match (single + array, first-match-wins), maxIterations bound, idle timeout fires after silence and resets on output, abort threading.
  5. Env var resolver — Layering precedence across the four sources; disjoint enforcement throws; RunOptions.env overlap allowed.
  6. Session capture — Fake isolated sandbox + temp host directory: JSONL transfer + cwd rewrite (host-side and sandbox-side), --resume validation (maxIterations > 1 throws, missing host file errors), best-effort failure (capture error → warning + run still succeeds).
  7. Hook runner — Sequential host hooks (ordering preserved), parallel host/sandbox onSandboxReady (both started, neither blocks the other), per-hook timeout, signal threading cancels in-flight commands, non-zero exit fails fast.

Integration tests

  • End-to-end against test-isolated.tsrun(), createSandbox() reuse, createWorktree() + wt.run() + wt.createSandbox(), interactive() (with a scripted "TUI" fake). Validates the public API surface and the layered ownership contract for close().
  • Real Docker — A small smoke suite that spawns the actual Docker provider on a fixture repo and confirms a commit lands; gated behind a CI flag so contributors without Docker can skip.

Prior art

This is a green-field repo, so there is no prior art inside it yet. Reference patterns from the broader ecosystem:

  • Effect Context.Tag seams — the canonical Effect pattern for swapping a service in tests; the agent invoker uses it.
  • Vitest + temp dirs for git — standard Node test pattern; the worktree manager uses it.
  • Scripted-stream fakes — generator-based fakes that yield text/toolCall events deterministically; the iteration loop uses these.

Out of Scope

  • Implementing isolated sandbox providers beyond test-isolated.ts and vercel.ts. The type system supports them from day one, but other isolated providers (e.g. Fly.io, gVisor, Kata) are user-built or follow-on work.
  • Bundle/patch sync for isolated providers. Mentioned in CONTEXT.md as a future option but not part of v1.
  • Built-in observability backends. sanddune ships the onAgentStreamEvent hook; integrations with Datadog, OpenTelemetry, etc. are downstream.
  • Built-in retry / failure-recovery policies. sandbox.run() is reusable after abort, but sanddune does not roll back partial edits or commits — callers retry from a clean slate themselves.
  • Multi-agent coordination at a single iteration. One iteration = one agent invocation. Pipelines (implement → review → revise) compose at the run() / sandbox.run() level.
  • Web UI / dashboard. sanddune is a library + CLI. Dashboards are downstream products.
  • Non-git VCS. Mercurial, jj, etc. are out of scope. Worktrees, branches, commits, and git status --porcelain are assumed throughout.
  • Windows support beyond what ADR 0006 specifies. WSL is the assumed Windows path; native Windows is best-effort.
  • Built-in templates beyond the five listed. blank, simple-loop, sequential-reviewer, parallel-planner, parallel-planner-with-review ship; everything else is user-authored.
  • Backlog managers beyond GitHub Issues and Beads. Other tracker integrations are out of scope for v1; the backlog manager abstraction is open for extension but only two implementations ship.
  • Server-side / hosted sanddune. Library and CLI only; no daemon, no SaaS.

Further Notes

  • Domain vocabulary is authoritative. All implementation, code review, docs, and PRs use the terms defined in CONTEXT.md (sanddune, sandbox, host, agent, sandbox provider, branch strategy, worktree, source/target branch, agent invoker, iteration, task, completion signal, prompt template, prompt argument, prompt expansion, shell expression, etc.). Avoid retired terms ("workspace", "worktree mode", "the tool").
  • ADRs guard prior decisions. When implementing, re-read the ADRs in docs/adr/ rather than re-debating settled questions. New architectural choices that don't fit an existing ADR get a new ADR before implementation.
  • Effect is in the stack. The agent invoker is an Effect Context.Tag. The codebase will follow Effect conventions for service registration, error channels, and resource management; introduce new services as Context.Tags where the swap-in-tests benefit applies.
  • tsgo for builds, vitest for tests, npm run typecheck for type-checking. Per Claude.md. CI mirrors these.
  • Changesets for user-facing changes. Pre-1.0, all changesets are patch. Use package.json#name as the changeset name.
  • README and CONTEXT.md drift is a real risk. The brief, the README, and CONTEXT.md overlap heavily. When changing public-facing behavior, update both — don't let one outpace the other.
  • This PRD covers v1 surface area, not the full implementation order. The decomposition into deep modules is the seam for sequencing follow-on issues; expect this PRD to be sliced into many issues (one per module + integration tests + CLI commands) by the to-issues skill.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions