Conversation
Introduces the Council system for making quick, focused LLM judgments. Council members are lightweight evaluators that answer specific classification questions using Claude Haiku. Current members: - fixJudge: Evaluates whether a patch addressed a reported issue - duplicateJudge: Identifies semantic duplicates between findings Also adds fix evaluation pipeline that detects when follow-up commits attempt to fix Warden's reported issues, automatically resolving successful fixes and posting feedback on failed attempts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Addresses critical test gaps identified in review: - PR workflow integration tests for fix evaluation - Verifies evaluateFixAttempts called on synchronize events - Tests conditional skipping (no API key, no comments, non-sync events) - Tests reply posting for failed fixes - Tests graceful error handling - Direct tests for evaluateFix() with mocked Anthropic - Tests all three verdict statuses (resolved, attempted_failed, not_attempted) - Tests fallback behavior on API errors and invalid responses - New synchronize event payload fixture Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Stale detection was incorrectly resolving comments when findings weren't re-detected, even if the underlying bug was never fixed. Now comments are only resolved when the file is orphaned (no longer in PR scope). Resolution based on findings is handled by fix evaluation, which uses an LLM to verify that a patch actually addressed the issue. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- json.ts: Use extractBalancedJson instead of greedy regex to correctly handle nested JSON and avoid over-matching when text has trailing braces - patch-analysis.ts: Handle pure deletion hunks (newCount=0) without producing inverted ranges where end < start - pr-workflow.ts: Only skip stale detection for comments that were successfully resolved, allowing retry for partial failures Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extends the council member system to support tool use, allowing members like fix-judge to fetch additional context when evaluating fixes. Changes: - Add CouncilTool and ToolContext types to council/types.ts - Add tool execution loop to convene() with max iterations - Create tools.ts helper for schema conversion and tool execution - Update fix-judge to use before/after code snippets instead of patches - Add getFileAtCommit tool for fetching file content at specific commits - Add fetchFileContent/fetchFileLines helpers to github-actions.ts - Update orchestrator to fetch code snippets and pass tool context Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix maxToolIterations loop condition from <= to < so it allows exactly maxToolIterations tool calls instead of maxToolIterations + 1 - Update renderSuggestion to show both removed and added lines in a diff block for better context, in addition to the GitHub suggestion block Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add JSDoc comment explaining the GitHub suggestion block format with a link to the official documentation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The fix-judge was treating line numbers as the source of truth, causing it to return "not_attempted" when the reported line was off by one even though the actual fix was visible in the context. Now the prompt: - Clarifies the job is to evaluate if the ISSUE DESCRIBED was fixed - Labels location as "approximate" and "near line X" - Reframes "not_attempted" as "no changes related to this issue" Refs #105 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
GitHub App installation tokens return 403 on GET /user. Use GET /app instead to get the app slug, then construct the bot login as "slug[bot]". Falls back to GET /user for PATs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The check was dead code since commentsToEvaluate is already filtered to only include comments present in relevantPatches. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update the fix-judge prompt to explicitly recognize that removing problematic code is a valid way to resolve an issue. This handles cases like deprecated code cleanup, dead code removal, and intentional feature deletion. The key change reframes evaluation from "was the suggested fix applied?" to "does the issue still exist?" which is the more accurate question. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
4d6eefb to
e97f1a4
Compare
Previously, fix evaluation compared previousHeadSha (the commit before this push) to currentSha, which only showed changes in the latest commit. This meant fixes made in earlier commits weren't detected. Now we compare baseSha (the PR's base branch commit) to currentSha, showing all changes in the PR. This ensures we can detect fixes made in any commit, not just the most recent one. Added baseSha to PullRequestContext and updated all contexts/tests. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Auto-infer prefill from schema type ('{' for objects, '[' for arrays)
- Increase default max tokens from 512 to 4096
- Restructure fix-judge prompt with verdict definitions first
- Add debug mode with tool call metadata tracking
- Track failedEvaluations count for fallback monitoring
- Increase context lines from 10 to 20, add optional codeAfterFix
- Add integration tests for fix-judge with real API fixtures
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The description in renderSuggestion's deletion-only fallback was not being escaped, inconsistent with other LLM-generated content rendering. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Stop exporting FixEvaluationResult and FixEvaluationContext (unused externally) - Rename to EvaluateFixAttemptsResult/Context to match function name - Rename single evaluation type to EvaluateFixResult to match evaluateFix() Each function's result type now matches its name: - evaluateFix() → EvaluateFixResult - evaluateFixAttempts() → EvaluateFixAttemptsResult Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Two fixes for review state bugs: 1. Premature APPROVE: Previously, APPROVE was sent when no blocking findings existed in the current run, without checking if unresolved Warden comments remained from previous runs. Now coordination checks for unresolved comments and suppresses APPROVE accordingly. 2. Spurious "fix attempt" replies: Made fix-judge more conservative about returning `attempted_failed`. Now requires changes to directly modify the reported file near the issue location with clear intent. Also added commit messages to the judge prompt to help understand developer intent. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace APPROVE reviews with GitHub's dismissReview API for clearing CHANGES_REQUESTED state. This is semantically more correct - APPROVE means "I vouch for this PR" while DISMISS means "my concerns are addressed." Changes: - Add dismissPreviousReview() to dismiss CHANGES_REQUESTED when all issues are resolved - Capture review ID from listReviews API (already being fetched) - Remove APPROVE from renderer - only REQUEST_CHANGES or COMMENT - Remove previousReviewState from rendering pipeline (no longer needed) - Simplify coordination logic since APPROVE suppression is obsolete - Update specs to document DISMISS behavior This reduces code complexity significantly (-503 lines net) by removing the approval coordination logic that was needed to prevent conflicting APPROVE reviews across triggers. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove tests that verify structure/config rather than behavior: - members.test.ts: Remove metadata checks, registry structure tests - github-actions.test.ts: Delete entirely (string formatting covered by orchestrator tests) Keep behavioral tests that prove functionality works. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, when resolving stale comments, all comments were added to staleCommentsResolved if resolvedCount > 0. This was incorrect when some resolutions failed (e.g., 2 of 3 due to API errors), as it would mark failed comments as resolved and potentially dismiss reviews incorrectly. Now only track comments as resolved when resolvedCount equals the total stale comment count, ensuring we don't accidentally skip unresolved issues during review dismissal. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| const unresolvedWardenComments = fetchedComments.filter( | ||
| (c) => c.isWarden && !c.isResolved | ||
| ); | ||
| const reviewCoordination = buildReviewCoordination(results, unresolvedWardenComments); |
There was a problem hiding this comment.
Dead code: unresolvedWardenComments computed but never used
Low Severity
The unresolvedWardenComments variable is computed by filtering fetchedComments, passed to buildReviewCoordination, which forwards it to coordinateReviewEvents - but that function explicitly ignores the parameter (prefixed with _). The accompanying comment references "approval flow" but approval was removed in this PR and replaced with dismissal. The actual "all issues resolved" check is correctly implemented separately in the dismissal logic at lines 392-420. This creates misleading documentation and wasted computation.
Additional Locations (2)
Reimplement fix evaluation and duplicate detection judges as plain functions without the generic council abstraction from #103. The fix judge uses Haiku with tool_use to explore code changes and determine whether reported issues were resolved, partially fixed, or unaddressed. - Add shared Haiku utility (callHaiku, callHaikuWithTools, extractJson) - Add fix-evaluation module with judge, GitHub helpers, and orchestrator - Refactor dedup's findSemanticDuplicates to use callHaiku - Add baseSha to PullRequestContext for fix evaluation range - Integrate fix evaluation into PR workflow after review posting - Add tests: extractJson pure tests, orchestrator behavioral tests, live judge integration tests (API-key gated) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


Introduces the Council system for making quick, focused LLM judgments, and refactors how Warden manages findings.
Council of Warden
Council members are lightweight evaluators that answer specific classification questions using Claude Haiku.
Current members:
fixJudge- Evaluates whether code changes addressed a reported issue (not_attempted / attempted_failed / resolved)duplicateJudge- Identifies semantic duplicates between new findings and existing PR commentsFix evaluation with tool-driven exploration:
When a follow-up commit is pushed after Warden posts comments, the system:
get_file_diff,get_file_at_commitThe tool-based approach lets the judge find fixes regardless of where they occur - at the original location, in a helper function, or in a related file.
Usage tracking:
All council calls track token usage and cost, aggregated across tool iterations. This enables monitoring LLM costs at the session level.
Review State Refactor: DISMISS instead of APPROVE
Previously, when all issues were resolved, Warden posted an APPROVE review to clear CHANGES_REQUESTED. This was semantically incorrect - APPROVE means "I vouch for this PR" rather than "my concerns are addressed."
Now Warden uses GitHub's
dismissReviewAPI instead:pull_requests: write)This change significantly simplifies the codebase by removing the complex approval coordination logic that prevented conflicting APPROVE reviews across triggers (-503 lines net).
Design philosophy
See
specs/council-of-warden.mdfor design details andspecs/fix-evaluation.mdfor the fix evaluation flow.