feat(review): add Cursor CLI as an experimental review agent provider#912
Open
backnotprop wants to merge 6 commits into
Open
feat(review): add Cursor CLI as an experimental review agent provider#912backnotprop wants to merge 6 commits into
backnotprop wants to merge 6 commits into
Conversation
backnotprop
added a commit
that referenced
this pull request
Jun 13, 2026
- [P1] Pass the review prompt as the trailing positional argv arg instead of stdin. 'agent' reads task text from argv ([prompt...]); on stdin it received an empty prompt and never saw the diff/instructions, so no real review ran. - [P2] Add --trust to the headless command. Without it 'agent -p' stops on an interactive workspace-trust prompt a background job can't answer, failing the first review in any untrusted repo. Safe alongside --mode ask + --sandbox. - [P2] Derive the job verdict from finding severities (like Claude) instead of storing Cursor's free-form summary.correctness. A value like 'not correct' was accepted verbatim and the detail panel (string contains 'correct' except 'incorrect' -> green) would invert the displayed result. - [P3] Carry a partial-line buffer across stdout chunks in the live-log drainer. stream-json is NDJSON with arbitrary chunk boundaries; records split across chunks were dropped from live logs (fixes Claude's same latent bug too). All changes mirrored into the Pi runtime (serverReview.ts, agent-jobs.ts). Typecheck green; 58 tests pass across cursor-review + affected families.
backnotprop
added a commit
that referenced
this pull request
Jun 14, 2026
- [P2] Skip Cursor's final buffered assistant flush in live logs. We always run with --stream-partial-output, so only real deltas (timestamp_ms present, model_call_id absent) should show; the no-timestamp end-of-turn flush repeats the whole assistant output. Added a test. - [important] Verify the 'agent' binary is actually Cursor before offering it. 'agent' is a generic name; capability detection now probes 'agent about --format json' (fast, unauthenticated) and requires a cliVersion field, so an unrelated 'agent' on PATH is never spawned with review content. - [nit] Cursor is fail-closed: an unexpected throw in onJobComplete now flips the job to failed instead of being swallowed (Claude/Codex stay fail-open). - [P3] Cursor jobs now label as 'Cursor' (was 'Shell') in the detail panel. - [nit] Cursor's single-option model row renders static text, not a dead dropdown. - [nit] Pi buildCommand JSDoc now lists cursor. All server changes mirrored Bun<->Pi. Typecheck green; 59 tests pass. Note: the argv-size finding was triaged and intentionally NOT changed — Cursor passes the prompt as argv exactly like Codex already does (parity, not a regression). The diff is only embedded for full-stack/workspace/unknown-VCS reviews; normal reviews hand the agent a git instruction or URL.
Adds 'cursor' as a third background review provider alongside Claude, Codex, and Code Tour, plugged into the existing /api/agents/jobs system (not Ask AI). Design principle: match the existing providers, don't out-engineer them. Cursor's only real difference is prose output (no schema flag), which earns strict PARSING only — no extra isolation, grounding, or finding-policing. Backend: - packages/server/cursor-review.ts: command builder (read-only flags, prompt on stdin), line-buffered NDJSON stream reducer, static-marker last-block JSON extraction, schema validation, finding->annotation transform. - agent-jobs.ts: capability detection (binary is 'agent', not 'cursor'), fail-closed server-built-provider guard (client argv is never spawned for built-in providers), cursor stdout-log dispatch. - review.ts onJobComplete: fail-closed BY MUTATION (never throws — the runner swallows throws), PR/diffScope attribution, workspace path normalization. - Full Pi parity (agent-jobs.ts, serverReview.ts, vendored via vendor.sh). UI (capability-gated, invisible unless the Cursor CLI is installed): - AgentsTab/useAgentSettings: Cursor as a review-only engine with an experimental label and Auto model default. Read-only posture is flags-only: --mode ask + --sandbox enabled, no --force. Env is inherited like Claude/Codex. Tests: packages/server/cursor-review.test.ts (22, high-value only). Full typecheck + affected review-agent test families pass. OUTSTANDING (not automatable here): empirically verify 'agent -p --mode ask' is non-mutating with an authenticated Cursor account before trusting in production. Ship as experimental until confirmed.
- [P1] Pass the review prompt as the trailing positional argv arg instead of stdin. 'agent' reads task text from argv ([prompt...]); on stdin it received an empty prompt and never saw the diff/instructions, so no real review ran. - [P2] Add --trust to the headless command. Without it 'agent -p' stops on an interactive workspace-trust prompt a background job can't answer, failing the first review in any untrusted repo. Safe alongside --mode ask + --sandbox. - [P2] Derive the job verdict from finding severities (like Claude) instead of storing Cursor's free-form summary.correctness. A value like 'not correct' was accepted verbatim and the detail panel (string contains 'correct' except 'incorrect' -> green) would invert the displayed result. - [P3] Carry a partial-line buffer across stdout chunks in the live-log drainer. stream-json is NDJSON with arbitrary chunk boundaries; records split across chunks were dropped from live logs (fixes Claude's same latent bug too). All changes mirrored into the Pi runtime (serverReview.ts, agent-jobs.ts). Typecheck green; 58 tests pass across cursor-review + affected families.
- [P2] Skip Cursor's final buffered assistant flush in live logs. We always run with --stream-partial-output, so only real deltas (timestamp_ms present, model_call_id absent) should show; the no-timestamp end-of-turn flush repeats the whole assistant output. Added a test. - [important] Verify the 'agent' binary is actually Cursor before offering it. 'agent' is a generic name; capability detection now probes 'agent about --format json' (fast, unauthenticated) and requires a cliVersion field, so an unrelated 'agent' on PATH is never spawned with review content. - [nit] Cursor is fail-closed: an unexpected throw in onJobComplete now flips the job to failed instead of being swallowed (Claude/Codex stay fail-open). - [P3] Cursor jobs now label as 'Cursor' (was 'Shell') in the detail panel. - [nit] Cursor's single-option model row renders static text, not a dead dropdown. - [nit] Pi buildCommand JSDoc now lists cursor. All server changes mirrored Bun<->Pi. Typecheck green; 59 tests pass. Note: the argv-size finding was triaged and intentionally NOT changed — Cursor passes the prompt as argv exactly like Codex already does (parity, not a regression). The diff is only embedded for full-stack/workspace/unknown-VCS reviews; normal reviews hand the agent a git instruction or URL.
…h the live-log formatter drops Self-review follow-up: the assistant-dedup rule differs on purpose between reduceCursorStreamEvents (keeps the no-timestamp end-of-turn flush as a marker-parse safety net) and formatCursorLogEvent (drops it to avoid duplicating output in live logs). Documents the relationship so the two aren't mistakenly unified. Comment-only.
…repo-guidance-aware) Replaces the homemade Claude-derived prompt with an investigation-first prompt tuned for Cursor's agentic model: - Tells the agent to use tools — read the full module, trace call sites, check sibling code/tests, confirm the issue isn't already handled — before reporting. - Discovers and honors repo guidance if present (CLAUDE.md, REVIEW.md, AGENTS.md, .cursor/rules/*, .cursor/BUGBOT.md root+nested). So project-specific rules flow in from the repo instead of being hardcoded — same prompt works for any repo, and a repo's BUGBOT.md serves both Cursor Bugbot and this CLI review. - Adds an explicit 'Do NOT report' list (formatting/naming, speculative refactors, 'consider adding tests', exotic hypotheticals) to cut noise. - Requires impact + trigger in each finding; one finding per distinct bug. - Optimizes for findings a maintainer would actually fix. Output contract unchanged: same <plannotator-review-json> marker block and the same important/nit/pre_existing severities the parser, transform, and UI depend on. No test asserts prompt text; typecheck + 23 tests pass.
402cb04 to
d314f93
Compare
Replaces the hardcoded single-entry Cursor model list with the real, account-specific catalog. `agent models` returns ~139 models (current ones: Opus 4.8, GPT-5.5, Fable 5, Gemini 3.1, Grok, etc.) in a stable `id - Label` format — hardcoding any subset would go stale immediately. - cursor-review.ts: parseCursorModelsOutput() — runtime-agnostic parser for the `agent models` text (skips header/tip, dedupes). Shared, vendored to Pi. - agent-jobs.ts (Bun + Pi): discoverCursorModels() spawns `agent models` at review-server capability detection (best-effort, cached; empty when the CLI is unauthenticated) and attaches the catalog to the cursor AgentCapability. - AgentCapability gains an optional `models` field (deliberate, per approach doc — not smuggled). Cursor capability detection now also gated to review mode. - UI: AgentsTab drives the model picker from capabilities.cursor.models, falling back to an `auto`-only list when none are reported. Default model id is now `auto` (lowercase, matching the real catalog); buildCursorCommand omits --model for `auto` case-insensitively. Verified the parser against the live CLI: parses all 139 models. Typecheck green; 61 tests pass (incl. 3 new parser tests). Binary rebuilt.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds Cursor CLI as an experimental third background review provider (
cursor), alongside Claude, Codex, and Code Tour. It plugs into the existing/api/agents/jobsjob system — not Ask AI — so it reuses the same job cards, live logs, findings, and external-annotation pipeline as the other review agents.Design principle
Match the existing providers; don't out-engineer them. Cursor's only real difference is that its output is prose, not schema-validated JSON (it has no
--json-schemaflag). That earns strict parsing — and nothing else. Deliberately not included: per-job nonce, diff-grounding gate, isolatedCURSOR_CONFIG_DIR, env scrubbing. Cursor inherits the environment exactly like Claude/Codex.Backend
packages/server/cursor-review.ts(new) — command builder (read-only flags, prompt on stdin), a line-buffered NDJSON stream reducer (pipe chunks ≠ record boundaries), static-marker last-block JSON extraction, schema validation, and a finding→annotation transform mirroringtransformClaudeFindings(authorCursor).agent-jobs.ts— capability detection (the Cursor binary is literallyagent, notcursor); a fail-closed server-built-provider guard so client-supplied argv is never spawned for built-in providers; cursor stdout-log dispatch.review.tsonJobComplete— fails closed by mutating the job (job.status = "failed"), never by throwing (the runner swallows throws and would otherwise leave an exit-0 job green); PR/diff-scope attribution; workspace path normalization.apps/pi-extension/server/{agent-jobs,serverReview}.ts, vendored viavendor.sh.UI
Capability-gated and invisible unless the Cursor CLI is installed. In Code Review mode, Cursor appears as a review-only engine with an experimental label and an
Automodel default. Claude / Codex / Code Tour are untouched. (This part is intentionally minimal — flagged for refinement during review.)Safety
Read-only posture is flags-only:
--mode ask+--sandbox enabled, no--force/--yolo.Verification
bun run typecheck(vendor.sh + tsc across shared/ai/server/ui/pi-extension): green.cursor-review.test.ts: 22/22 (high-value only — chunk-split NDJSON, partial-output dedup, last-block-wins, every failure path → fail, transform shape).This was built and then put through an adversarial multi-agent review, which caught and fixed a real blocker (a Bun-only call in dead model-discovery code that broke the Pi Node typecheck — removed).
The read-only safety gate is human-only: confirming
agent -p --mode askis actually non-mutating requires an authenticated Cursor account (the dev shell was unauthenticated). The code ships the right flags and is labeled experimental, but that behavioral guarantee is not yet verified.Notes for reviewers
cursor-cli-review-agent-approach.md+ handoffs) were kept out of this commit to keep the diff focused on code. They can be added if useful.