Skip to content

feat(review): add Cursor CLI as an experimental review agent provider#912

Open
backnotprop wants to merge 6 commits into
mainfrom
feat/cursor-cli
Open

feat(review): add Cursor CLI as an experimental review agent provider#912
backnotprop wants to merge 6 commits into
mainfrom
feat/cursor-cli

Conversation

@backnotprop

Copy link
Copy Markdown
Owner

What

Adds Cursor CLI as an experimental third background review provider (cursor), alongside Claude, Codex, and Code Tour. It plugs into the existing /api/agents/jobs job system — not Ask AI — so it reuses the same job cards, live logs, findings, and external-annotation pipeline as the other review agents.

Design principle

Match the existing providers; don't out-engineer them. Cursor's only real difference is that its output is prose, not schema-validated JSON (it has no --json-schema flag). That earns strict parsing — and nothing else. Deliberately not included: per-job nonce, diff-grounding gate, isolated CURSOR_CONFIG_DIR, env scrubbing. Cursor inherits the environment exactly like Claude/Codex.

Backend

  • packages/server/cursor-review.ts (new) — command builder (read-only flags, prompt on stdin), a line-buffered NDJSON stream reducer (pipe chunks ≠ record boundaries), static-marker last-block JSON extraction, schema validation, and a finding→annotation transform mirroring transformClaudeFindings (author Cursor).
  • agent-jobs.ts — capability detection (the Cursor binary is literally agent, not cursor); a fail-closed server-built-provider guard so client-supplied argv is never spawned for built-in providers; cursor stdout-log dispatch.
  • review.ts onJobCompletefails closed by mutating the job (job.status = "failed"), never by throwing (the runner swallows throws and would otherwise leave an exit-0 job green); PR/diff-scope attribution; workspace path normalization.
  • Full Pi parityapps/pi-extension/server/{agent-jobs,serverReview}.ts, vendored via vendor.sh.

UI

Capability-gated and invisible unless the Cursor CLI is installed. In Code Review mode, Cursor appears as a review-only engine with an experimental label and an Auto model default. Claude / Codex / Code Tour are untouched. (This part is intentionally minimal — flagged for refinement during review.)

Safety

Read-only posture is flags-only: --mode ask + --sandbox enabled, no --force/--yolo.

Verification

  • bun run typecheck (vendor.sh + tsc across shared/ai/server/ui/pi-extension): green.
  • cursor-review.test.ts: 22/22 (high-value only — chunk-split NDJSON, partial-output dedup, last-block-wins, every failure path → fail, transform shape).
  • Affected existing review-agent test families: pass.

This was built and then put through an adversarial multi-agent review, which caught and fixed a real blocker (a Bun-only call in dead model-discovery code that broke the Pi Node typecheck — removed).

⚠️ Outstanding (must verify before trusting in production)

The read-only safety gate is human-only: confirming agent -p --mode ask is actually non-mutating requires an authenticated Cursor account (the dev shell was unauthenticated). The code ships the right flags and is labeled experimental, but that behavioral guarantee is not yet verified.

Notes for reviewers

  • Design rationale docs (cursor-cli-review-agent-approach.md + handoffs) were kept out of this commit to keep the diff focused on code. They can be added if useful.
  • UI is type-clean but not runtime-exercised (built/reviewed headlessly).

backnotprop added a commit that referenced this pull request Jun 13, 2026
- [P1] Pass the review prompt as the trailing positional argv arg instead of
  stdin. 'agent' reads task text from argv ([prompt...]); on stdin it received
  an empty prompt and never saw the diff/instructions, so no real review ran.
- [P2] Add --trust to the headless command. Without it 'agent -p' stops on an
  interactive workspace-trust prompt a background job can't answer, failing the
  first review in any untrusted repo. Safe alongside --mode ask + --sandbox.
- [P2] Derive the job verdict from finding severities (like Claude) instead of
  storing Cursor's free-form summary.correctness. A value like 'not correct'
  was accepted verbatim and the detail panel (string contains 'correct' except
  'incorrect' -> green) would invert the displayed result.
- [P3] Carry a partial-line buffer across stdout chunks in the live-log drainer.
  stream-json is NDJSON with arbitrary chunk boundaries; records split across
  chunks were dropped from live logs (fixes Claude's same latent bug too).

All changes mirrored into the Pi runtime (serverReview.ts, agent-jobs.ts).
Typecheck green; 58 tests pass across cursor-review + affected families.
backnotprop added a commit that referenced this pull request Jun 14, 2026
- [P2] Skip Cursor's final buffered assistant flush in live logs. We always run
  with --stream-partial-output, so only real deltas (timestamp_ms present,
  model_call_id absent) should show; the no-timestamp end-of-turn flush repeats
  the whole assistant output. Added a test.
- [important] Verify the 'agent' binary is actually Cursor before offering it.
  'agent' is a generic name; capability detection now probes 'agent about
  --format json' (fast, unauthenticated) and requires a cliVersion field, so an
  unrelated 'agent' on PATH is never spawned with review content.
- [nit] Cursor is fail-closed: an unexpected throw in onJobComplete now flips the
  job to failed instead of being swallowed (Claude/Codex stay fail-open).
- [P3] Cursor jobs now label as 'Cursor' (was 'Shell') in the detail panel.
- [nit] Cursor's single-option model row renders static text, not a dead dropdown.
- [nit] Pi buildCommand JSDoc now lists cursor.

All server changes mirrored Bun<->Pi. Typecheck green; 59 tests pass.

Note: the argv-size finding was triaged and intentionally NOT changed — Cursor
passes the prompt as argv exactly like Codex already does (parity, not a
regression). The diff is only embedded for full-stack/workspace/unknown-VCS
reviews; normal reviews hand the agent a git instruction or URL.
Adds 'cursor' as a third background review provider alongside Claude, Codex,
and Code Tour, plugged into the existing /api/agents/jobs system (not Ask AI).

Design principle: match the existing providers, don't out-engineer them. Cursor's
only real difference is prose output (no schema flag), which earns strict PARSING
only — no extra isolation, grounding, or finding-policing.

Backend:
- packages/server/cursor-review.ts: command builder (read-only flags, prompt on
  stdin), line-buffered NDJSON stream reducer, static-marker last-block JSON
  extraction, schema validation, finding->annotation transform.
- agent-jobs.ts: capability detection (binary is 'agent', not 'cursor'),
  fail-closed server-built-provider guard (client argv is never spawned for
  built-in providers), cursor stdout-log dispatch.
- review.ts onJobComplete: fail-closed BY MUTATION (never throws — the runner
  swallows throws), PR/diffScope attribution, workspace path normalization.
- Full Pi parity (agent-jobs.ts, serverReview.ts, vendored via vendor.sh).

UI (capability-gated, invisible unless the Cursor CLI is installed):
- AgentsTab/useAgentSettings: Cursor as a review-only engine with an experimental
  label and Auto model default.

Read-only posture is flags-only: --mode ask + --sandbox enabled, no --force.
Env is inherited like Claude/Codex.

Tests: packages/server/cursor-review.test.ts (22, high-value only). Full
typecheck + affected review-agent test families pass.

OUTSTANDING (not automatable here): empirically verify 'agent -p --mode ask' is
non-mutating with an authenticated Cursor account before trusting in production.
Ship as experimental until confirmed.
- [P1] Pass the review prompt as the trailing positional argv arg instead of
  stdin. 'agent' reads task text from argv ([prompt...]); on stdin it received
  an empty prompt and never saw the diff/instructions, so no real review ran.
- [P2] Add --trust to the headless command. Without it 'agent -p' stops on an
  interactive workspace-trust prompt a background job can't answer, failing the
  first review in any untrusted repo. Safe alongside --mode ask + --sandbox.
- [P2] Derive the job verdict from finding severities (like Claude) instead of
  storing Cursor's free-form summary.correctness. A value like 'not correct'
  was accepted verbatim and the detail panel (string contains 'correct' except
  'incorrect' -> green) would invert the displayed result.
- [P3] Carry a partial-line buffer across stdout chunks in the live-log drainer.
  stream-json is NDJSON with arbitrary chunk boundaries; records split across
  chunks were dropped from live logs (fixes Claude's same latent bug too).

All changes mirrored into the Pi runtime (serverReview.ts, agent-jobs.ts).
Typecheck green; 58 tests pass across cursor-review + affected families.
- [P2] Skip Cursor's final buffered assistant flush in live logs. We always run
  with --stream-partial-output, so only real deltas (timestamp_ms present,
  model_call_id absent) should show; the no-timestamp end-of-turn flush repeats
  the whole assistant output. Added a test.
- [important] Verify the 'agent' binary is actually Cursor before offering it.
  'agent' is a generic name; capability detection now probes 'agent about
  --format json' (fast, unauthenticated) and requires a cliVersion field, so an
  unrelated 'agent' on PATH is never spawned with review content.
- [nit] Cursor is fail-closed: an unexpected throw in onJobComplete now flips the
  job to failed instead of being swallowed (Claude/Codex stay fail-open).
- [P3] Cursor jobs now label as 'Cursor' (was 'Shell') in the detail panel.
- [nit] Cursor's single-option model row renders static text, not a dead dropdown.
- [nit] Pi buildCommand JSDoc now lists cursor.

All server changes mirrored Bun<->Pi. Typecheck green; 59 tests pass.

Note: the argv-size finding was triaged and intentionally NOT changed — Cursor
passes the prompt as argv exactly like Codex already does (parity, not a
regression). The diff is only embedded for full-stack/workspace/unknown-VCS
reviews; normal reviews hand the agent a git instruction or URL.
…h the live-log formatter drops

Self-review follow-up: the assistant-dedup rule differs on purpose between
reduceCursorStreamEvents (keeps the no-timestamp end-of-turn flush as a
marker-parse safety net) and formatCursorLogEvent (drops it to avoid
duplicating output in live logs). Documents the relationship so the two aren't
mistakenly unified. Comment-only.
…repo-guidance-aware)

Replaces the homemade Claude-derived prompt with an investigation-first prompt
tuned for Cursor's agentic model:
- Tells the agent to use tools — read the full module, trace call sites, check
  sibling code/tests, confirm the issue isn't already handled — before reporting.
- Discovers and honors repo guidance if present (CLAUDE.md, REVIEW.md, AGENTS.md,
  .cursor/rules/*, .cursor/BUGBOT.md root+nested). So project-specific rules flow
  in from the repo instead of being hardcoded — same prompt works for any repo,
  and a repo's BUGBOT.md serves both Cursor Bugbot and this CLI review.
- Adds an explicit 'Do NOT report' list (formatting/naming, speculative
  refactors, 'consider adding tests', exotic hypotheticals) to cut noise.
- Requires impact + trigger in each finding; one finding per distinct bug.
- Optimizes for findings a maintainer would actually fix.

Output contract unchanged: same <plannotator-review-json> marker block and the
same important/nit/pre_existing severities the parser, transform, and UI depend
on. No test asserts prompt text; typecheck + 23 tests pass.
Replaces the hardcoded single-entry Cursor model list with the real,
account-specific catalog. `agent models` returns ~139 models (current ones:
Opus 4.8, GPT-5.5, Fable 5, Gemini 3.1, Grok, etc.) in a stable `id - Label`
format — hardcoding any subset would go stale immediately.

- cursor-review.ts: parseCursorModelsOutput() — runtime-agnostic parser for the
  `agent models` text (skips header/tip, dedupes). Shared, vendored to Pi.
- agent-jobs.ts (Bun + Pi): discoverCursorModels() spawns `agent models` at
  review-server capability detection (best-effort, cached; empty when the CLI is
  unauthenticated) and attaches the catalog to the cursor AgentCapability.
- AgentCapability gains an optional `models` field (deliberate, per approach
  doc — not smuggled). Cursor capability detection now also gated to review mode.
- UI: AgentsTab drives the model picker from capabilities.cursor.models, falling
  back to an `auto`-only list when none are reported. Default model id is now
  `auto` (lowercase, matching the real catalog); buildCursorCommand omits
  --model for `auto` case-insensitively.

Verified the parser against the live CLI: parses all 139 models. Typecheck
green; 61 tests pass (incl. 3 new parser tests). Binary rebuilt.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant