Skip to content

feat(e2e-harness): drive and snapshot the real wizard TUI#702

Merged
gewenyu99 merged 43 commits into
mainfrom
e2e-control-plane
Jun 24, 2026
Merged

feat(e2e-harness): drive and snapshot the real wizard TUI#702
gewenyu99 merged 43 commits into
mainfrom
e2e-control-plane

Conversation

@gewenyu99

@gewenyu99 gewenyu99 commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator

How to test

Agent route — drive the wizard yourself. In a fresh session in this repo, run the exploring-the-wizard skill. wizard-ci is registered in .mcp.json, so the tools are already bound: open_app boots the real TUI on an app, then read_state / perform_action / render_screen (which returns the real rendered screen).

CI snapshots — real-TUI visual regression. From a wizard-workbench checkout next to this repo (PostHog creds in its .env):

cd ../wizard-workbench && pnpm wizard-ci-snapshots

Runs the full real agent flow against express-todo through the real TUI, captures each key moment, diffs the committed baseline, and writes report.html. Or comment /wizard-ci on a PR — same run, posted back as a comment. (Pairs with PostHog/wizard-workbench#2012.)

What this is

A headless e2e control plane that drives the real wizard TUI and captures what it renders. Both routes share one primitive:

  • Host (scripts/tui-host.no-jest.ts) runs the real startTUI and drives its store by state manipulation — no keystrokes. Auth uses the phx key (same bearer as an OAuth token), so the TUI advances with no browser.
  • Capture (e2e-harness/tui-capture.ts) runs the host in a PTY (node-pty) and reads the real rendered screen via @xterm/headless.

Routes:

  • CI snapshots (tui-snapshots): the fixed e2e profile self-drives the host through the real agent run → one real-TUI text snapshot per key moment (including the run screen's progression), diffed against a committed baseline.
  • Agent (wizard-ci-mcp): an MCP server proxies the host so an agent decides each screen; render_screen returns the real frame. The exploring-the-wizard skill is the how-to.

None of it ships — it lives in e2e-harness/ + scripts/, out of src/.

@github-actions

Copy link
Copy Markdown

🧙 Wizard CI

Run the Wizard CI and test your changes against wizard-workbench example apps by replying with a GitHub comment using one of the following commands:

Test all apps:

  • /wizard-ci all

Test all apps in a directory:

  • /wizard-ci basic-integration
  • /wizard-ci error-tracking-upload-source-maps
  • /wizard-ci misc
  • /wizard-ci revenue

Test an individual app:

  • /wizard-ci basic-integration/android
  • /wizard-ci basic-integration/angular
  • /wizard-ci basic-integration/astro
Show more apps
  • /wizard-ci basic-integration/django
  • /wizard-ci basic-integration/fastapi
  • /wizard-ci basic-integration/flask
  • /wizard-ci basic-integration/javascript-node
  • /wizard-ci basic-integration/javascript-web
  • /wizard-ci basic-integration/laravel
  • /wizard-ci basic-integration/next-js
  • /wizard-ci basic-integration/nuxt
  • /wizard-ci basic-integration/python
  • /wizard-ci basic-integration/rails
  • /wizard-ci basic-integration/react-native
  • /wizard-ci basic-integration/react-router
  • /wizard-ci basic-integration/sveltekit
  • /wizard-ci basic-integration/swift
  • /wizard-ci basic-integration/tanstack-router
  • /wizard-ci basic-integration/tanstack-start
  • /wizard-ci basic-integration/vue
  • /wizard-ci error-tracking-upload-source-maps/android
  • /wizard-ci error-tracking-upload-source-maps/cicd-docker-node-raw
  • /wizard-ci error-tracking-upload-source-maps/cicd-github-actions-docker-node-raw
  • /wizard-ci error-tracking-upload-source-maps/cicd-github-actions-nested-docker-node-raw
  • /wizard-ci error-tracking-upload-source-maps/cicd-github-actions-node-raw
  • /wizard-ci error-tracking-upload-source-maps/cicd-gitlab-node-raw
  • /wizard-ci error-tracking-upload-source-maps/cicd-ssh-vps-node-raw
  • /wizard-ci error-tracking-upload-source-maps/flutter
  • /wizard-ci error-tracking-upload-source-maps/ios
  • /wizard-ci error-tracking-upload-source-maps/next
  • /wizard-ci error-tracking-upload-source-maps/next-no-posthog
  • /wizard-ci error-tracking-upload-source-maps/node-raw
  • /wizard-ci error-tracking-upload-source-maps/node-rollup
  • /wizard-ci error-tracking-upload-source-maps/node-rollup-typescript-plugin
  • /wizard-ci error-tracking-upload-source-maps/node-webpack
  • /wizard-ci error-tracking-upload-source-maps/nuxt-3-6
  • /wizard-ci error-tracking-upload-source-maps/nuxt-4-3
  • /wizard-ci error-tracking-upload-source-maps/react-native
  • /wizard-ci error-tracking-upload-source-maps/react-vite
  • /wizard-ci error-tracking-upload-source-maps/rust
  • /wizard-ci misc/quack-quack
  • /wizard-ci revenue/stripe

Results will be posted here when complete.

Comment thread e2e-harness/__tests__/__snapshots__/e2e-flow-snapshot.test.ts.snap
Comment thread src/lib/programs/posthog-integration/test/e2e.json
Comment thread e2e-harness/wizard-ci-tools.ts Outdated
Comment thread e2e-harness/wizard-ci-tools.ts Outdated
Comment thread e2e-harness/wizard-ci-driver.ts
Comment thread e2e-harness/action-registry.ts
Comment thread e2e-harness/recorder.ts Outdated
@gewenyu99 gewenyu99 requested a review from a team June 22, 2026 21:06
Comment thread scripts/ci-driver-demo.ts Outdated
gewenyu99 added a commit that referenced this pull request Jun 22, 2026
Same resolved version; just the package.json floor, so #701 and #702 don't
conflict on the zod line.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@gewenyu99 gewenyu99 changed the title feat(ci-driver): wizard-ci-tools control plane for headless e2e + record/replay feat(e2e-harness): drive and snapshot the real wizard TUI Jun 23, 2026
gewenyu99 and others added 15 commits June 23, 2026 21:31
…ord/replay

A control plane over the TUI store that drives the wizard end-to-end with no
terminal and no browser, for CI/e2e and agent-driven testing. The render is a
pure function of the nanostore, so driving committed state == driving the UI.

Core files (src/lib/ci-driver/):
- wizard-ci-driver.ts — read_state / list_actions / perform_action over a live
  WizardStore. read_state is a truthful, secret-free projection of committed
  state (+ derived currentScreen); perform_action commits via the exact store
  setter the Ink screen's key handler calls.
- action-registry.ts — declarative screen -> commit-action map (exhaustive over
  ScreenId/Overlay). The actuation surface: name an action, not a keystroke.
- wizard-ci-tools.ts — in-process MCP server exposing the three tools, so an
  external harness or LLM can drive a real run.
- e2e-profile.ts — WizardE2eProfile: a program's declarative e2e test definition
  (the UI choices). decideE2eAction(state, profile) maps screen -> commit, so
  the harness is generic and the choices live on the program.
- recorder.ts — captures a frame at each key moment (route/task/status/runPhase/
  overlay change) off the store's version counter; redacts the access token.
- replay.ts — reconstructs a throwaway store per frame and renders the REAL Ink
  screen back to ANSI, so a run replays in the terminal.
- DRIVING-E2E-FROM-AN-AGENT.md — how a future agent drives these.
- __tests__/ — control-plane walk, flow snapshot (TUI-snapshot analog), recorder.

Programs declare their flow's UI choices:
- programs/program-step.ts — ProgramConfig.e2e?: WizardE2eProfile.
- programs/posthog-integration/index.ts — the integration program's e2e profile.

Harness/entry scripts:
- scripts/e2e-full-run.no-jest.ts — headless full run: real WizardStore + InkUI
  (never rendered) + concurrent driver + real runAgent; emits a structured
  result + a recording.
- scripts/replay-e2e.no-jest.ts — replay a recording in the terminal.
- scripts/ci-driver-demo.ts — offline control-plane demo (no agent).

Additive; no core wizard behavior changed. The workbench `wizard-ci --e2e`
(PostHog/wizard-workbench) orchestrates these against real test apps.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The e2e UI-choices object moves out of index.ts into a co-located e2e.ts
(POSTHOG_INTEGRATION_E2E_PROFILE), keeping the program config lean and the
flow's test definition in its own file.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
scripts/record-demo.no-jest.ts — produces a recording offline (no agent, no
network) by driving the integration flow with the e2e profile + a WizardRecorder,
so `replay-e2e.no-jest.ts` can be tried without a full run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
scripts/README.md documents the manual control-plane + record/replay tools
(what each does, what it needs, how to run). Also commits ci-driver-live-agent.ts
(real gateway LLM drives the wizard-ci-tools MCP server) so the index is complete.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
main added two confirm-and-continue intro screens (WarehouseIntro,
SelfDrivingIntro, both call store.completeSetup()). The action-registry
exhaustiveness test flagged them as uncovered. Register both as confirm_setup
in ACTION_REGISTRY and in the e2e walk policy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…l refs

Move DRIVING-E2E-FROM-AN-AGENT.md → ARCHITECTURE.md to match the co-located
subsystem-doc convention (cf. programs/self-driving/ARCHITECTURE.md). Remove
content that shouldn't ship in the public repo: the internal test project id +
team name, the workbench test-api-key.txt secret file, and pointers to
workbench-only scratch files. Keep the architecture, profiles, record/replay, and
MCP-loop guidance; generalize the run instructions. Update the scripts/README link.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
scripts/render-snapshots.no-jest.ts renders every key-moment frame of a recording
to a real-Ink ANSI snapshot (one <seq>-<screen>.ans per frame), via replay's
renderFrame under tsx. These feed the workbench visual-regression flow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
None of the control-plane / recording / e2e machinery belongs in the wizard's
production source. Relocate src/lib/ci-driver/ → e2e-harness/ at the repo root
(next to e2e-tests/), and sever every prod coupling:

- Remove the ProgramConfig.e2e field (program-step.ts) and the on-program profile
  (delete posthog-integration/e2e.ts, unwire index.ts). Per-program profiles now
  live in the harness — e2e-harness/profiles.ts, profileFor(programId).
- Add an @e2e-harness/* path alias (tsconfig.build.json + jest moduleNameMapper);
  repoint scripts/tests off @lib/ci-driver.

Result: src/ has ZERO references to the harness, and the published tsdown bundle
contains none of it (previously the ~90-byte profile object shipped). Full suite
(1045 tests, 3 snapshots) passes; real-recording render verified under tsx.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ARCHITECTURE.md now documents the wizard-ci-snapshots visual-regression flow
(real run → render → diff → side-by-side report) and the env it needs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…gram

A test/ README documents this program's e2e test definition — the path the
headless run walks and the option it auto-takes at each screen (confirm intro,
dismiss outage, first setup option, skip mcp/slack, delete skills). It's the
human description; the runnable profile stays in e2e-harness/profiles.ts. No e2e
machinery returns to prod src — this is documentation only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…oads

Each program declares its e2e test path as src/lib/programs/<program>/test/e2e.json
— a `profile` (the options the headless run auto-takes) plus a documented `path`
of every screen. The harness imports the `profile` in e2e-harness/profiles.ts
(single source of truth, no prose duplication). Matches the repo's existing
JSON-data pattern (mcp-role-prompts.copy.json); resolveJsonModule already on.

It's data, imported only by the harness — zero prod imports, absent from the
tsdown bundle. Full harness suite + runtime load verified.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the end-to-end trace (agent → perform_action → driver → action-registry →
store.completeSetup → emitChange → router re-resolve → readState) as a comment at
the perform_action tool, with cross-referenced breadcrumbs at the driver hop
(one committed mutation per call) and the action-registry hop (the store setter +
flag-flip the screen sequence reacts to). Harness-only; prod store.ts untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…dule

Add a header note to wizard-ci-tools / wizard-ci-driver / action-registry /
recorder / replay: each lives in e2e-harness/, is imported only by scripts/tests,
and is absent from the tsdown bundle (bin.ts is the only entry). Addresses the
"this looks shippable" worry right where a reader meets the code (esp. the MCP
server + SDK import). Verified: no e2e symbols in dist/.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Moving the trace / never-ships / credentials notes to PR review comments anchored
to the lines instead — keep the source uncluttered.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
gewenyu99 and others added 19 commits June 23, 2026 21:31
…-by-turn

scripts/wizard-ci-mcp.no-jest.ts is a stdio MCP server over one live WizardStore:
read_state / list_actions / perform_action / render_screen / run_agent. An agent
registers it and makes every decision live, instead of the static scripted run.
Rewrite the exploring-the-wizard skill to lead with this. Bump zod ^3.24→^3.25
(the MCP SDK needs the zod/v3 subpath; non-breaking) and add the SDK as a dep.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Same resolved version; just the package.json floor, so #701 and #702 don't
conflict on the zod line.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
read_state already returns the legal actions, so the separate tool is noise.
Keeps the server's surface minimal: read_state, perform_action, render_screen,
run_agent.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…hange

Running prettier on these (not in lint-staged) reflowed the whole files — pure
diff noise. Restore them to main and re-apply just the intended edits: the
"Explore with an agent" section + the exploring-the-wizard skill row.
…d runbook

EXPLORING-AS-AN-AGENT.md was promoted to .claude/skills/exploring-the-wizard/;
this pointer fix was left uncommitted, so HEAD still linked the deleted file.
…ion start

The skill told agents to `claude mcp add` then immediately call the tools, which
is impossible (MCP servers load at session start), so agents fell back to a
script. Lead with the in-session way that actually works — a WizardCiDriver
script (read_state → perform_action → renderFrame), tested — and document the MCP
server as the interactive option that needs registering before a fresh session.
…with it

Connect the stdio transport first and build the store lazily on the first tool
call — detection + the networked health probe used to run before connect(), which
could stall the MCP handshake so Claude Code saw the server as broken. Verified
end-to-end: `claude mcp add` → `claude mcp list` shows ✔ Connected → a headless
session drove read_state → perform_action(confirm_setup) → auth → render_screen.

Skill now leads with the two-phase MCP flow (register, then drive in a fresh
session, since MCP tools bind at session start); the driver script is the fallback.
…drives in one session

Register wizard-ci in .mcp.json so its tools are bound in every session in this
repo. An agent following the exploring-the-wizard skill now drives the wizard over
MCP (open_app -> read_state -> perform_action -> render_screen -> run_agent)
without registering anything or starting a fresh session. The server boots
app-agnostic; open_app picks the app + key at call time, so the committed config
holds no secrets. Skill + README rewritten to the one-session MCP flow.

Verified: a fresh headless agent given only the skill drove the wizard with four
MCP calls and wrote zero scripts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Just say to point appDir at the directory that has the package.json.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
appDir is just the throwaway copy of the app; let the agent find the path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
auth (and run) are NO_ACTION screens: session.credentials is set only inside
bootstrapProgram, which runs via run_agent. So nothing advances past auth without
run_agent — but the tool description said "call when currentScreen=run" and the
skill walk skipped auth, so an agent landed on auth and polled instead of calling
run_agent. Fix the run_agent description and the skill walk/key-facts to say
run_agent bootstraps creds and advances auth+run; don't poll those screens.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ves the run

A real run_agent call blocked the stdio MCP server for ~3 minutes; the client
treated the server as unhealthy, reconnected, and the restarted process lost its
in-memory store ("No app open", runPhase reset to idle). run_agent now starts the
integration in the background and returns immediately; read_state stays responsive
and reports runPhase running -> completed plus an integration status, so the agent
polls instead of blocking. Skill + tool descriptions updated to the poll model;
noted that run_agent creates real PostHog resources each run.

Proven: run_agent returns in 0.0s; read_state during the run answers in 1-2ms with
runPhase=running.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…or both routes

Both e2e routes run the real wizard TUI (startTUI) driven by store state
manipulation — no keystrokes — and capture the real rendered screen from a PTY.
Auth is satisfied by setCredentials with the phx key (same bearer as an OAuth
token), so the TUI advances with no browser.

- e2e-harness/tui-capture.ts — run a command in a PTY (node-pty), read its screen
  via @xterm/headless.
- scripts/tui-host.no-jest.ts — the real-TUI host. MODE=fixed self-drives the
  fixed e2e profile, signals each screen, writes a structured result JSON;
  MODE=serve takes drive commands over a unix socket.
- scripts/tui-snapshots.no-jest.ts — CI route: real-TUI text snapshot per screen.
- scripts/wizard-ci-mcp.no-jest.ts — agent route: MCP server proxying the host.
- scripts/wizard-ci-explore.no-jest.ts — drive the MCP route, print the real TUI.
- scripts/tui-replay.no-jest.ts — replay captured snapshots in the terminal.

Deletes the record-then-reconstruct machinery (recorder, replay, e2e-full-run,
render-snapshots, replay-e2e) and the in-process wizard-ci-tools server. Adds
node-pty + @xterm/headless.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sition

Snapshot on key moments — a screen change, a task-list update, or a runPhase
change — via a store subscription, and snap each screen before the driver acts on
it. The run screen (the agent working) is captured as it progresses, and fast
transitions (intro/auth/outro/mcp/slack) are no longer skipped by throttling.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ed loop

Snapshot on every key-moment change (no throttle spacing, just a settle). And
don't await the driver loop at exit — on the cheap (no-agent) path it's parked in
waitForChange, so awaiting it hung the process and exited non-zero, which would
fail CI. The process now exits 0 cleanly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The fixed CI route always drives the full real agent run — a no-agent path was
pointless (and is what hung at exit). Removes the RUN_AGENT branch and the
auth-by-state shortcut it needed in fixed mode; auth is bootstrapped by the run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
node-pty ships no linux-x64 prebuilt, so CI must compile it; pnpm 10 blocks build
scripts unless allowlisted.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ink renders non-interactively when it detects CI (CI / GITHUB_ACTIONS), leaving
the captured xterm buffer blank. Strip them from the spawned host's env. Verified
locally: with CI=true, render_screen now returns the real TUI instead of blank.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@gewenyu99 gewenyu99 force-pushed the e2e-control-plane branch from 426da5a to c506fea Compare June 24, 2026 01:32
gewenyu99 and others added 3 commits June 24, 2026 11:31
main added the source-maps detection screen; the action-registry
exhaustiveness test requires every screen be actionable or explicitly
no-action. The integration e2e profile never enters the source-maps
program, so it joins the other non-integration screens in
NO_ACTION_SCREENS, with a note to wire it in when a source-maps
profile drives that program.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
postbuild copies scripts/ into dist (which ships); drop the *.no-jest.*
e2e/CI scripts from dist so the published wizard carries only runtime
scripts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- drop a stray blank line from posthog-integration config (no prod diff)
- extract the shared intro/health-check/run sequence in tui-host
- pass projectId to getOrAskForProjectData as a number (its declared type)
- strip host AI_AGENT alongside CLAUDE/ANTHROPIC, matching the workbench

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@edwinyjlim edwinyjlim left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good since it's all additive

Comment thread scripts/wizard-ci-mcp.no-jest.ts Outdated
Comment thread scripts/tui-host.no-jest.ts
gewenyu99 and others added 2 commits June 24, 2026 17:54
- never write an inline api key to disk; pass it to the host via env
  (POSTHOG_PERSONAL_API_KEY), same as the CI path. A caller-supplied
  keyFile is still used as-is.
- surface a failed run's error in read_state (integrationError) so CI
  and the agent see why the integration failed instead of a bare 'failed'.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Spell out the explore walk (open_app, snapshot each key moment, act,
run_agent, finish) and have it save numbered render_screen frames to
/tmp/wz-explore-snaps, matching the CI route's .txt frames. Align the
skill's snapshot guidance with the README example.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@gewenyu99 gewenyu99 merged commit bd875c3 into main Jun 24, 2026
17 checks passed
@gewenyu99 gewenyu99 deleted the e2e-control-plane branch June 24, 2026 22:01
sarahxsanders added a commit that referenced this pull request Jun 25, 2026
Resolved 5 conflicts from main's #702/#725/#726:

- runner/index.ts: combined our idempotent flushScanReport finalizer
  (registerCleanup + finally, return await) with main's stampVariant() calls
  in both fork arms
- constants.ts: kept WIZARD_WARLOCK_DISABLED_FLAG_KEY; took main's removal of
  WIZARD_VARIANTS (variant is now runner-derived via stampVariant)
- package.json: kept both new deps (@vitest/coverage-v8 + @xterm/headless);
  dropped main's re-added root jest config block (root is vitest now; e2e-tests
  keeps its own jest config)
- tsconfig.json: added main's e2e-harness to include; kept our e2e-tests
  exclusion (standalone jest package, not in the vitest root typecheck)
- pnpm-lock.yaml: regenerated via pnpm install

Canonicalized main's new e2e-harness snapshots to vitest key format (content
unchanged; jest used "describe test", vitest uses "describe > test").

Full suite green: 987 tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants