feat(review): add Cursor CLI as an experimental review agent provider by backnotprop · Pull Request #912 · backnotprop/plannotator

backnotprop · 2026-06-13T20:52:22Z

What

Adds Cursor CLI as an experimental third background review provider (cursor), alongside Claude, Codex, and Code Tour. It plugs into the existing /api/agents/jobs job system — not Ask AI — so it reuses the same job cards, live logs, findings, and external-annotation pipeline as the other review agents.

Design principle

Match the existing providers; don't out-engineer them. Cursor's only real difference is that its output is prose, not schema-validated JSON (it has no --json-schema flag). That earns strict parsing — and nothing else. Deliberately not included: per-job nonce, diff-grounding gate, isolated CURSOR_CONFIG_DIR, env scrubbing. Cursor inherits the environment exactly like Claude/Codex.

Backend

packages/server/cursor-review.ts (new) — command builder (read-only flags, prompt on stdin), a line-buffered NDJSON stream reducer (pipe chunks ≠ record boundaries), static-marker last-block JSON extraction, schema validation, and a finding→annotation transform mirroring transformClaudeFindings (author Cursor).
agent-jobs.ts — capability detection (the Cursor binary is literally agent, not cursor); a fail-closed server-built-provider guard so client-supplied argv is never spawned for built-in providers; cursor stdout-log dispatch.
review.ts onJobComplete — fails closed by mutating the job (job.status = "failed"), never by throwing (the runner swallows throws and would otherwise leave an exit-0 job green); PR/diff-scope attribution; workspace path normalization.
Full Pi parity — apps/pi-extension/server/{agent-jobs,serverReview}.ts, vendored via vendor.sh.

UI

Capability-gated and invisible unless the Cursor CLI is installed. In Code Review mode, Cursor appears as a review-only engine with an experimental label and an Auto model default. Claude / Codex / Code Tour are untouched. (This part is intentionally minimal — flagged for refinement during review.)

Safety

Read-only posture is flags-only: --mode ask + --sandbox enabled, no --force/--yolo.

Verification

bun run typecheck (vendor.sh + tsc across shared/ai/server/ui/pi-extension): green.
cursor-review.test.ts: 22/22 (high-value only — chunk-split NDJSON, partial-output dedup, last-block-wins, every failure path → fail, transform shape).
Affected existing review-agent test families: pass.

This was built and then put through an adversarial multi-agent review, which caught and fixed a real blocker (a Bun-only call in dead model-discovery code that broke the Pi Node typecheck — removed).

⚠️ Outstanding (must verify before trusting in production)

The read-only safety gate is human-only: confirming agent -p --mode ask is actually non-mutating requires an authenticated Cursor account (the dev shell was unauthenticated). The code ships the right flags and is labeled experimental, but that behavioral guarantee is not yet verified.

Notes for reviewers

Design rationale docs (cursor-cli-review-agent-approach.md + handoffs) were kept out of this commit to keep the diff focused on code. They can be added if useful.
UI is type-clean but not runtime-exercised (built/reviewed headlessly).

- [P1] Pass the review prompt as the trailing positional argv arg instead of stdin. 'agent' reads task text from argv ([prompt...]); on stdin it received an empty prompt and never saw the diff/instructions, so no real review ran. - [P2] Add --trust to the headless command. Without it 'agent -p' stops on an interactive workspace-trust prompt a background job can't answer, failing the first review in any untrusted repo. Safe alongside --mode ask + --sandbox. - [P2] Derive the job verdict from finding severities (like Claude) instead of storing Cursor's free-form summary.correctness. A value like 'not correct' was accepted verbatim and the detail panel (string contains 'correct' except 'incorrect' -> green) would invert the displayed result. - [P3] Carry a partial-line buffer across stdout chunks in the live-log drainer. stream-json is NDJSON with arbitrary chunk boundaries; records split across chunks were dropped from live logs (fixes Claude's same latent bug too). All changes mirrored into the Pi runtime (serverReview.ts, agent-jobs.ts). Typecheck green; 58 tests pass across cursor-review + affected families.

- [P2] Skip Cursor's final buffered assistant flush in live logs. We always run with --stream-partial-output, so only real deltas (timestamp_ms present, model_call_id absent) should show; the no-timestamp end-of-turn flush repeats the whole assistant output. Added a test. - [important] Verify the 'agent' binary is actually Cursor before offering it. 'agent' is a generic name; capability detection now probes 'agent about --format json' (fast, unauthenticated) and requires a cliVersion field, so an unrelated 'agent' on PATH is never spawned with review content. - [nit] Cursor is fail-closed: an unexpected throw in onJobComplete now flips the job to failed instead of being swallowed (Claude/Codex stay fail-open). - [P3] Cursor jobs now label as 'Cursor' (was 'Shell') in the detail panel. - [nit] Cursor's single-option model row renders static text, not a dead dropdown. - [nit] Pi buildCommand JSDoc now lists cursor. All server changes mirrored Bun<->Pi. Typecheck green; 59 tests pass. Note: the argv-size finding was triaged and intentionally NOT changed — Cursor passes the prompt as argv exactly like Codex already does (parity, not a regression). The diff is only embedded for full-stack/workspace/unknown-VCS reviews; normal reviews hand the agent a git instruction or URL.

Adds 'cursor' as a third background review provider alongside Claude, Codex, and Code Tour, plugged into the existing /api/agents/jobs system (not Ask AI). Design principle: match the existing providers, don't out-engineer them. Cursor's only real difference is prose output (no schema flag), which earns strict PARSING only — no extra isolation, grounding, or finding-policing. Backend: - packages/server/cursor-review.ts: command builder (read-only flags, prompt on stdin), line-buffered NDJSON stream reducer, static-marker last-block JSON extraction, schema validation, finding->annotation transform. - agent-jobs.ts: capability detection (binary is 'agent', not 'cursor'), fail-closed server-built-provider guard (client argv is never spawned for built-in providers), cursor stdout-log dispatch. - review.ts onJobComplete: fail-closed BY MUTATION (never throws — the runner swallows throws), PR/diffScope attribution, workspace path normalization. - Full Pi parity (agent-jobs.ts, serverReview.ts, vendored via vendor.sh). UI (capability-gated, invisible unless the Cursor CLI is installed): - AgentsTab/useAgentSettings: Cursor as a review-only engine with an experimental label and Auto model default. Read-only posture is flags-only: --mode ask + --sandbox enabled, no --force. Env is inherited like Claude/Codex. Tests: packages/server/cursor-review.test.ts (22, high-value only). Full typecheck + affected review-agent test families pass. OUTSTANDING (not automatable here): empirically verify 'agent -p --mode ask' is non-mutating with an authenticated Cursor account before trusting in production. Ship as experimental until confirmed.

- [P1] Pass the review prompt as the trailing positional argv arg instead of stdin. 'agent' reads task text from argv ([prompt...]); on stdin it received an empty prompt and never saw the diff/instructions, so no real review ran. - [P2] Add --trust to the headless command. Without it 'agent -p' stops on an interactive workspace-trust prompt a background job can't answer, failing the first review in any untrusted repo. Safe alongside --mode ask + --sandbox. - [P2] Derive the job verdict from finding severities (like Claude) instead of storing Cursor's free-form summary.correctness. A value like 'not correct' was accepted verbatim and the detail panel (string contains 'correct' except 'incorrect' -> green) would invert the displayed result. - [P3] Carry a partial-line buffer across stdout chunks in the live-log drainer. stream-json is NDJSON with arbitrary chunk boundaries; records split across chunks were dropped from live logs (fixes Claude's same latent bug too). All changes mirrored into the Pi runtime (serverReview.ts, agent-jobs.ts). Typecheck green; 58 tests pass across cursor-review + affected families.

- [P2] Skip Cursor's final buffered assistant flush in live logs. We always run with --stream-partial-output, so only real deltas (timestamp_ms present, model_call_id absent) should show; the no-timestamp end-of-turn flush repeats the whole assistant output. Added a test. - [important] Verify the 'agent' binary is actually Cursor before offering it. 'agent' is a generic name; capability detection now probes 'agent about --format json' (fast, unauthenticated) and requires a cliVersion field, so an unrelated 'agent' on PATH is never spawned with review content. - [nit] Cursor is fail-closed: an unexpected throw in onJobComplete now flips the job to failed instead of being swallowed (Claude/Codex stay fail-open). - [P3] Cursor jobs now label as 'Cursor' (was 'Shell') in the detail panel. - [nit] Cursor's single-option model row renders static text, not a dead dropdown. - [nit] Pi buildCommand JSDoc now lists cursor. All server changes mirrored Bun<->Pi. Typecheck green; 59 tests pass. Note: the argv-size finding was triaged and intentionally NOT changed — Cursor passes the prompt as argv exactly like Codex already does (parity, not a regression). The diff is only embedded for full-stack/workspace/unknown-VCS reviews; normal reviews hand the agent a git instruction or URL.

…h the live-log formatter drops Self-review follow-up: the assistant-dedup rule differs on purpose between reduceCursorStreamEvents (keeps the no-timestamp end-of-turn flush as a marker-parse safety net) and formatCursorLogEvent (drops it to avoid duplicating output in live logs). Documents the relationship so the two aren't mistakenly unified. Comment-only.

…repo-guidance-aware) Replaces the homemade Claude-derived prompt with an investigation-first prompt tuned for Cursor's agentic model: - Tells the agent to use tools — read the full module, trace call sites, check sibling code/tests, confirm the issue isn't already handled — before reporting. - Discovers and honors repo guidance if present (CLAUDE.md, REVIEW.md, AGENTS.md, .cursor/rules/*, .cursor/BUGBOT.md root+nested). So project-specific rules flow in from the repo instead of being hardcoded — same prompt works for any repo, and a repo's BUGBOT.md serves both Cursor Bugbot and this CLI review. - Adds an explicit 'Do NOT report' list (formatting/naming, speculative refactors, 'consider adding tests', exotic hypotheticals) to cut noise. - Requires impact + trigger in each finding; one finding per distinct bug. - Optimizes for findings a maintainer would actually fix. Output contract unchanged: same <plannotator-review-json> marker block and the same important/nit/pre_existing severities the parser, transform, and UI depend on. No test asserts prompt text; typecheck + 23 tests pass.

Replaces the hardcoded single-entry Cursor model list with the real, account-specific catalog. `agent models` returns ~139 models (current ones: Opus 4.8, GPT-5.5, Fable 5, Gemini 3.1, Grok, etc.) in a stable `id - Label` format — hardcoding any subset would go stale immediately. - cursor-review.ts: parseCursorModelsOutput() — runtime-agnostic parser for the `agent models` text (skips header/tip, dedupes). Shared, vendored to Pi. - agent-jobs.ts (Bun + Pi): discoverCursorModels() spawns `agent models` at review-server capability detection (best-effort, cached; empty when the CLI is unauthenticated) and attaches the catalog to the cursor AgentCapability. - AgentCapability gains an optional `models` field (deliberate, per approach doc — not smuggled). Cursor capability detection now also gated to review mode. - UI: AgentsTab drives the model picker from capabilities.cursor.models, falling back to an `auto`-only list when none are reported. Default model id is now `auto` (lowercase, matching the real catalog); buildCursorCommand omits --model for `auto` case-insensitively. Verified the parser against the live CLI: parses all 139 models. Typecheck green; 61 tests pass (incl. 3 new parser tests). Binary rebuilt.

backnotprop mentioned this pull request Jun 13, 2026

fix(review): signal-safe cleanup + triage suffix for PR feedback #914

Merged

backnotprop added 5 commits June 13, 2026 20:06

backnotprop force-pushed the feat/cursor-cli branch from 402cb04 to d314f93 Compare June 14, 2026 03:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(review): add Cursor CLI as an experimental review agent provider#912

feat(review): add Cursor CLI as an experimental review agent provider#912
backnotprop wants to merge 6 commits into
mainfrom
feat/cursor-cli

backnotprop commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

backnotprop commented Jun 13, 2026

What

Design principle

Backend

UI

Safety

Verification

⚠️ Outstanding (must verify before trusting in production)

Notes for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant