Skip to content

feat(council): Refactor GitHub Workflow#103

Open
dcramer wants to merge 26 commits intomainfrom
feat/council-of-warden
Open

feat(council): Refactor GitHub Workflow#103
dcramer wants to merge 26 commits intomainfrom
feat/council-of-warden

Conversation

@dcramer
Copy link
Member

@dcramer dcramer commented Feb 4, 2026

Introduces the Council system for making quick, focused LLM judgments, and refactors how Warden manages findings.

Council of Warden

Council members are lightweight evaluators that answer specific classification questions using Claude Haiku.

Current members:

  • fixJudge - Evaluates whether code changes addressed a reported issue (not_attempted / attempted_failed / resolved)
  • duplicateJudge - Identifies semantic duplicates between new findings and existing PR comments

Fix evaluation with tool-driven exploration:
When a follow-up commit is pushed after Warden posts comments, the system:

  1. Evaluates all unresolved Warden comments (up to 20 per push)
  2. Gives the fix-judge tools to explore: get_file_diff, get_file_at_commit
  3. The judge examines changes to determine if the issue was fixed
  4. Cross-checks against re-detection (did we find the same issue again?)
  5. Auto-resolves successful fixes or posts feedback explaining why the fix didn't work

The tool-based approach lets the judge find fixes regardless of where they occur - at the original location, in a helper function, or in a related file.

Usage tracking:
All council calls track token usage and cost, aggregated across tool iterations. This enables monitoring LLM costs at the session level.

Review State Refactor: DISMISS instead of APPROVE

Previously, when all issues were resolved, Warden posted an APPROVE review to clear CHANGES_REQUESTED. This was semantically incorrect - APPROVE means "I vouch for this PR" rather than "my concerns are addressed."

Now Warden uses GitHub's dismissReview API instead:

  • Removes the CHANGES_REQUESTED block without approving
  • More accurately reflects "my concerns are addressed"
  • Requires no new permissions (can dismiss your own reviews with pull_requests: write)

This change significantly simplifies the codebase by removing the complex approval coordination logic that prevented conflicting APPROVE reviews across triggers (-503 lines net).

Design philosophy

  • Council members handle focused judgments (single question → structured verdict)
  • Skill agents (SDK) handle complex multi-step detection
  • Caller code handles side effects (posting comments, resolving threads)

See specs/council-of-warden.md for design details and specs/fix-evaluation.md for the fix evaluation flow.

Introduces the Council system for making quick, focused LLM judgments.
Council members are lightweight evaluators that answer specific
classification questions using Claude Haiku.

Current members:
- fixJudge: Evaluates whether a patch addressed a reported issue
- duplicateJudge: Identifies semantic duplicates between findings

Also adds fix evaluation pipeline that detects when follow-up commits
attempt to fix Warden's reported issues, automatically resolving
successful fixes and posting feedback on failed attempts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@vercel
Copy link

vercel bot commented Feb 4, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
warden Ready Ready Preview, Comment Feb 5, 2026 7:37am

Request Review

Addresses critical test gaps identified in review:

- PR workflow integration tests for fix evaluation
  - Verifies evaluateFixAttempts called on synchronize events
  - Tests conditional skipping (no API key, no comments, non-sync events)
  - Tests reply posting for failed fixes
  - Tests graceful error handling

- Direct tests for evaluateFix() with mocked Anthropic
  - Tests all three verdict statuses (resolved, attempted_failed, not_attempted)
  - Tests fallback behavior on API errors and invalid responses

- New synchronize event payload fixture

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Stale detection was incorrectly resolving comments when findings weren't
re-detected, even if the underlying bug was never fixed. Now comments are
only resolved when the file is orphaned (no longer in PR scope). Resolution
based on findings is handled by fix evaluation, which uses an LLM to verify
that a patch actually addressed the issue.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- json.ts: Use extractBalancedJson instead of greedy regex to correctly
  handle nested JSON and avoid over-matching when text has trailing braces
- patch-analysis.ts: Handle pure deletion hunks (newCount=0) without
  producing inverted ranges where end < start
- pr-workflow.ts: Only skip stale detection for comments that were
  successfully resolved, allowing retry for partial failures

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extends the council member system to support tool use, allowing members
like fix-judge to fetch additional context when evaluating fixes.

Changes:
- Add CouncilTool and ToolContext types to council/types.ts
- Add tool execution loop to convene() with max iterations
- Create tools.ts helper for schema conversion and tool execution
- Update fix-judge to use before/after code snippets instead of patches
- Add getFileAtCommit tool for fetching file content at specific commits
- Add fetchFileContent/fetchFileLines helpers to github-actions.ts
- Update orchestrator to fetch code snippets and pass tool context

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix maxToolIterations loop condition from <= to < so it allows exactly
  maxToolIterations tool calls instead of maxToolIterations + 1
- Update renderSuggestion to show both removed and added lines in a diff
  block for better context, in addition to the GitHub suggestion block

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add JSDoc comment explaining the GitHub suggestion block format with a
link to the official documentation.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The fix-judge was treating line numbers as the source of truth, causing
it to return "not_attempted" when the reported line was off by one even
though the actual fix was visible in the context. Now the prompt:

- Clarifies the job is to evaluate if the ISSUE DESCRIBED was fixed
- Labels location as "approximate" and "near line X"
- Reframes "not_attempted" as "no changes related to this issue"

Refs #105

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
GitHub App installation tokens return 403 on GET /user. Use GET /app
instead to get the app slug, then construct the bot login as
"slug[bot]". Falls back to GET /user for PATs.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The check was dead code since commentsToEvaluate is already filtered
to only include comments present in relevantPatches.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update the fix-judge prompt to explicitly recognize that removing
problematic code is a valid way to resolve an issue. This handles
cases like deprecated code cleanup, dead code removal, and intentional
feature deletion.

The key change reframes evaluation from "was the suggested fix applied?"
to "does the issue still exist?" which is the more accurate question.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, fix evaluation compared previousHeadSha (the commit before
this push) to currentSha, which only showed changes in the latest
commit. This meant fixes made in earlier commits weren't detected.

Now we compare baseSha (the PR's base branch commit) to currentSha,
showing all changes in the PR. This ensures we can detect fixes made
in any commit, not just the most recent one.

Added baseSha to PullRequestContext and updated all contexts/tests.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

@sentry-warden sentry-warden bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All previously reported issues have been resolved.

- Auto-infer prefill from schema type ('{' for objects, '[' for arrays)
- Increase default max tokens from 512 to 4096
- Restructure fix-judge prompt with verdict definitions first
- Add debug mode with tool call metadata tracking
- Track failedEvaluations count for fallback monitoring
- Increase context lines from 10 to 20, add optional codeAfterFix
- Add integration tests for fix-judge with real API fixtures

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The description in renderSuggestion's deletion-only fallback was not
being escaped, inconsistent with other LLM-generated content rendering.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Stop exporting FixEvaluationResult and FixEvaluationContext (unused externally)
- Rename to EvaluateFixAttemptsResult/Context to match function name
- Rename single evaluation type to EvaluateFixResult to match evaluateFix()

Each function's result type now matches its name:
- evaluateFix() → EvaluateFixResult
- evaluateFixAttempts() → EvaluateFixAttemptsResult

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Two fixes for review state bugs:

1. Premature APPROVE: Previously, APPROVE was sent when no blocking
   findings existed in the current run, without checking if unresolved
   Warden comments remained from previous runs. Now coordination checks
   for unresolved comments and suppresses APPROVE accordingly.

2. Spurious "fix attempt" replies: Made fix-judge more conservative
   about returning `attempted_failed`. Now requires changes to directly
   modify the reported file near the issue location with clear intent.
   Also added commit messages to the judge prompt to help understand
   developer intent.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace APPROVE reviews with GitHub's dismissReview API for clearing
CHANGES_REQUESTED state. This is semantically more correct - APPROVE
means "I vouch for this PR" while DISMISS means "my concerns are
addressed."

Changes:
- Add dismissPreviousReview() to dismiss CHANGES_REQUESTED when all
  issues are resolved
- Capture review ID from listReviews API (already being fetched)
- Remove APPROVE from renderer - only REQUEST_CHANGES or COMMENT
- Remove previousReviewState from rendering pipeline (no longer needed)
- Simplify coordination logic since APPROVE suppression is obsolete
- Update specs to document DISMISS behavior

This reduces code complexity significantly (-503 lines net) by removing
the approval coordination logic that was needed to prevent conflicting
APPROVE reviews across triggers.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@dcramer dcramer changed the title feat(council): Add Council of Warden for LLM-powered judgments feat(council): Add Council of Warden + refactor review state management Feb 5, 2026
@dcramer dcramer changed the title feat(council): Add Council of Warden + refactor review state management feat(council): Refactor GitHub Workflow Feb 5, 2026
dcramer and others added 2 commits February 4, 2026 23:36
Remove tests that verify structure/config rather than behavior:
- members.test.ts: Remove metadata checks, registry structure tests
- github-actions.test.ts: Delete entirely (string formatting covered by orchestrator tests)

Keep behavioral tests that prove functionality works.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, when resolving stale comments, all comments were added to
staleCommentsResolved if resolvedCount > 0. This was incorrect when
some resolutions failed (e.g., 2 of 3 due to API errors), as it would
mark failed comments as resolved and potentially dismiss reviews
incorrectly.

Now only track comments as resolved when resolvedCount equals the
total stale comment count, ensuring we don't accidentally skip
unresolved issues during review dismissal.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

const unresolvedWardenComments = fetchedComments.filter(
(c) => c.isWarden && !c.isResolved
);
const reviewCoordination = buildReviewCoordination(results, unresolvedWardenComments);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code: unresolvedWardenComments computed but never used

Low Severity

The unresolvedWardenComments variable is computed by filtering fetchedComments, passed to buildReviewCoordination, which forwards it to coordinateReviewEvents - but that function explicitly ignores the parameter (prefixed with _). The accompanying comment references "approval flow" but approval was removed in this PR and replaced with dismissal. The actual "all issues resolved" check is correctly implemented separately in the dismissal logic at lines 392-420. This creates misleading documentation and wasted computation.

Additional Locations (2)

Fix in Cursor Fix in Web

dcramer added a commit that referenced this pull request Feb 9, 2026
Reimplement fix evaluation and duplicate detection judges as plain
functions without the generic council abstraction from #103. The fix
judge uses Haiku with tool_use to explore code changes and determine
whether reported issues were resolved, partially fixed, or unaddressed.

- Add shared Haiku utility (callHaiku, callHaikuWithTools, extractJson)
- Add fix-evaluation module with judge, GitHub helpers, and orchestrator
- Refactor dedup's findSemanticDuplicates to use callHaiku
- Add baseSha to PullRequestContext for fix evaluation range
- Integrate fix evaluation into PR workflow after review posting
- Add tests: extractJson pure tests, orchestrator behavioral tests,
  live judge integration tests (API-key gated)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant