feat(council): Refactor GitHub Workflow by dcramer · Pull Request #103 · getsentry/warden

dcramer · 2026-02-04T23:51:35Z

Introduces the Council system for making quick, focused LLM judgments, and refactors how Warden manages findings.

Council of Warden

Council members are lightweight evaluators that answer specific classification questions using Claude Haiku.

Current members:

fixJudge - Evaluates whether code changes addressed a reported issue (not_attempted / attempted_failed / resolved)
duplicateJudge - Identifies semantic duplicates between new findings and existing PR comments

Fix evaluation with tool-driven exploration:
When a follow-up commit is pushed after Warden posts comments, the system:

Evaluates all unresolved Warden comments (up to 20 per push)
Gives the fix-judge tools to explore: get_file_diff, get_file_at_commit
The judge examines changes to determine if the issue was fixed
Cross-checks against re-detection (did we find the same issue again?)
Auto-resolves successful fixes or posts feedback explaining why the fix didn't work

The tool-based approach lets the judge find fixes regardless of where they occur - at the original location, in a helper function, or in a related file.

Usage tracking:
All council calls track token usage and cost, aggregated across tool iterations. This enables monitoring LLM costs at the session level.

Review State Refactor: DISMISS instead of APPROVE

Previously, when all issues were resolved, Warden posted an APPROVE review to clear CHANGES_REQUESTED. This was semantically incorrect - APPROVE means "I vouch for this PR" rather than "my concerns are addressed."

Now Warden uses GitHub's dismissReview API instead:

Removes the CHANGES_REQUESTED block without approving
More accurately reflects "my concerns are addressed"
Requires no new permissions (can dismiss your own reviews with pull_requests: write)

This change significantly simplifies the codebase by removing the complex approval coordination logic that prevented conflicting APPROVE reviews across triggers (-503 lines net).

Design philosophy

Council members handle focused judgments (single question → structured verdict)
Skill agents (SDK) handle complex multi-step detection
Caller code handles side effects (posting comments, resolving threads)

See specs/council-of-warden.md for design details and specs/fix-evaluation.md for the fix evaluation flow.

Introduces the Council system for making quick, focused LLM judgments. Council members are lightweight evaluators that answer specific classification questions using Claude Haiku. Current members: - fixJudge: Evaluates whether a patch addressed a reported issue - duplicateJudge: Identifies semantic duplicates between findings Also adds fix evaluation pipeline that detects when follow-up commits attempt to fix Warden's reported issues, automatically resolving successful fixes and posting feedback on failed attempts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

vercel · 2026-02-04T23:51:39Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
warden	Ready	Preview, Comment	Feb 5, 2026 7:37am

src/council/json.ts

src/output/fix-evaluation/patch-analysis.ts

Addresses critical test gaps identified in review: - PR workflow integration tests for fix evaluation - Verifies evaluateFixAttempts called on synchronize events - Tests conditional skipping (no API key, no comments, non-sync events) - Tests reply posting for failed fixes - Tests graceful error handling - Direct tests for evaluateFix() with mocked Anthropic - Tests all three verdict statuses (resolved, attempted_failed, not_attempted) - Tests fallback behavior on API errors and invalid responses - New synchronize event payload fixture Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

src/action/workflow/pr-workflow.ts

src/output/fix-evaluation/orchestrator.ts

src/council/convene.ts

Stale detection was incorrectly resolving comments when findings weren't re-detected, even if the underlying bug was never fixed. Now comments are only resolved when the file is orphaned (no longer in PR scope). Resolution based on findings is handled by fix evaluation, which uses an LLM to verify that a patch actually addressed the issue. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

src/council/json.ts

- json.ts: Use extractBalancedJson instead of greedy regex to correctly handle nested JSON and avoid over-matching when text has trailing braces - patch-analysis.ts: Handle pure deletion hunks (newCount=0) without producing inverted ranges where end < start - pr-workflow.ts: Only skip stale detection for comments that were successfully resolved, allowing retry for partial failures Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

src/output/fix-evaluation/patch-analysis.ts

src/output/stale.ts

Extends the council member system to support tool use, allowing members like fix-judge to fetch additional context when evaluating fixes. Changes: - Add CouncilTool and ToolContext types to council/types.ts - Add tool execution loop to convene() with max iterations - Create tools.ts helper for schema conversion and tool execution - Update fix-judge to use before/after code snippets instead of patches - Add getFileAtCommit tool for fetching file content at specific commits - Add fetchFileContent/fetchFileLines helpers to github-actions.ts - Update orchestrator to fetch code snippets and pass tool context Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

src/council/convene.ts

- Fix maxToolIterations loop condition from <= to < so it allows exactly maxToolIterations tool calls instead of maxToolIterations + 1 - Update renderSuggestion to show both removed and added lines in a diff block for better context, in addition to the GitHub suggestion block Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add JSDoc comment explaining the GitHub suggestion block format with a link to the official documentation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

src/output/fix-evaluation/orchestrator.ts

The fix-judge was treating line numbers as the source of truth, causing it to return "not_attempted" when the reported line was off by one even though the actual fix was visible in the context. Now the prompt: - Clarifies the job is to evaluate if the ISSUE DESCRIBED was fixed - Labels location as "approximate" and "near line X" - Reframes "not_attempted" as "no changes related to this issue" Refs #105 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

GitHub App installation tokens return 403 on GET /user. Use GET /app instead to get the app slug, then construct the bot login as "slug[bot]". Falls back to GET /user for PATs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The check was dead code since commentsToEvaluate is already filtered to only include comments present in relevantPatches. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

src/council/members/fix-judge.ts

Update the fix-judge prompt to explicitly recognize that removing problematic code is a valid way to resolve an issue. This handles cases like deprecated code cleanup, dead code removal, and intentional feature deletion. The key change reframes evaluation from "was the suggested fix applied?" to "does the issue still exist?" which is the more accurate question. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

src/output/fix-evaluation/orchestrator.ts

src/output/fix-evaluation/index.ts

Previously, fix evaluation compared previousHeadSha (the commit before this push) to currentSha, which only showed changes in the latest commit. This meant fixes made in earlier commits weren't detected. Now we compare baseSha (the PR's base branch commit) to currentSha, showing all changes in the PR. This ensures we can detect fixes made in any commit, not just the most recent one. Added baseSha to PullRequestContext and updated all contexts/tests. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

sentry-warden

All previously reported issues have been resolved.

- Auto-infer prefill from schema type ('{' for objects, '[' for arrays) - Increase default max tokens from 512 to 4096 - Restructure fix-judge prompt with verdict definitions first - Add debug mode with tool call metadata tracking - Track failedEvaluations count for fallback monitoring - Increase context lines from 10 to 20, add optional codeAfterFix - Add integration tests for fix-judge with real API fixtures Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The description in renderSuggestion's deletion-only fallback was not being escaped, inconsistent with other LLM-generated content rendering. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

src/output/fix-evaluation/orchestrator.ts

- Stop exporting FixEvaluationResult and FixEvaluationContext (unused externally) - Rename to EvaluateFixAttemptsResult/Context to match function name - Rename single evaluation type to EvaluateFixResult to match evaluateFix() Each function's result type now matches its name: - evaluateFix() → EvaluateFixResult - evaluateFixAttempts() → EvaluateFixAttemptsResult Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Two fixes for review state bugs: 1. Premature APPROVE: Previously, APPROVE was sent when no blocking findings existed in the current run, without checking if unresolved Warden comments remained from previous runs. Now coordination checks for unresolved comments and suppresses APPROVE accordingly. 2. Spurious "fix attempt" replies: Made fix-judge more conservative about returning `attempted_failed`. Now requires changes to directly modify the reported file near the issue location with clear intent. Also added commit messages to the judge prompt to help understand developer intent. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace APPROVE reviews with GitHub's dismissReview API for clearing CHANGES_REQUESTED state. This is semantically more correct - APPROVE means "I vouch for this PR" while DISMISS means "my concerns are addressed." Changes: - Add dismissPreviousReview() to dismiss CHANGES_REQUESTED when all issues are resolved - Capture review ID from listReviews API (already being fetched) - Remove APPROVE from renderer - only REQUEST_CHANGES or COMMENT - Remove previousReviewState from rendering pipeline (no longer needed) - Simplify coordination logic since APPROVE suppression is obsolete - Update specs to document DISMISS behavior This reduces code complexity significantly (-503 lines net) by removing the approval coordination logic that was needed to prevent conflicting APPROVE reviews across triggers. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

src/action/workflow/pr-workflow.ts

Remove tests that verify structure/config rather than behavior: - members.test.ts: Remove metadata checks, registry structure tests - github-actions.test.ts: Delete entirely (string formatting covered by orchestrator tests) Keep behavioral tests that prove functionality works. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Previously, when resolving stale comments, all comments were added to staleCommentsResolved if resolvedCount > 0. This was incorrect when some resolutions failed (e.g., 2 of 3 due to API errors), as it would mark failed comments as resolved and potentially dismiss reviews incorrectly. Now only track comments as resolved when resolvedCount equals the total stale comment count, ensuring we don't accidentally skip unresolved issues during review dismissal. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-05T07:45:40Z

src/action/workflow/pr-workflow.ts

+  const unresolvedWardenComments = fetchedComments.filter(
+    (c) => c.isWarden && !c.isResolved
+  );
+  const reviewCoordination = buildReviewCoordination(results, unresolvedWardenComments);


Dead code: unresolvedWardenComments computed but never used

Low Severity

The unresolvedWardenComments variable is computed by filtering fetchedComments, passed to buildReviewCoordination, which forwards it to coordinateReviewEvents - but that function explicitly ignores the parameter (prefixed with _). The accompanying comment references "approval flow" but approval was removed in this PR and replaced with dismissal. The actual "all issues resolved" check is correctly implemented separately in the dismissal logic at lines 392-420. This creates misleading documentation and wasted computation.

Additional Locations (2)

src/action/review/coordination.ts#L46-L62

src/action/review-state.ts#L56-L67

Reimplement fix evaluation and duplicate detection judges as plain functions without the generic council abstraction from #103. The fix judge uses Haiku with tool_use to explore code changes and determine whether reported issues were resolved, partially fixed, or unaddressed. - Add shared Haiku utility (callHaiku, callHaikuWithTools, extractJson) - Add fix-evaluation module with judge, GitHub helpers, and orchestrator - Refactor dedup's findSemanticDuplicates to use callHaiku - Add baseSha to PullRequestContext for fix evaluation range - Integrate fix evaluation into PR workflow after review posting - Add tests: extractJson pure tests, orchestrator behavioral tests, live judge integration tests (API-key gated) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

vercel bot deployed to Preview February 4, 2026 23:51 View deployment

sentry-warden bot reviewed Feb 4, 2026

View reviewed changes

src/council/json.ts Outdated Show resolved Hide resolved

src/output/fix-evaluation/patch-analysis.ts Show resolved Hide resolved

vercel bot deployed to Preview February 5, 2026 00:00 View deployment

sentry-warden bot reviewed Feb 5, 2026

View reviewed changes

src/action/workflow/pr-workflow.ts Outdated Show resolved Hide resolved

cursor bot reviewed Feb 5, 2026

View reviewed changes

src/output/fix-evaluation/orchestrator.ts Show resolved Hide resolved

src/council/convene.ts Outdated Show resolved Hide resolved

vercel bot deployed to Preview February 5, 2026 00:37 View deployment

cursor bot reviewed Feb 5, 2026

View reviewed changes

src/council/json.ts Outdated Show resolved Hide resolved

vercel bot deployed to Preview February 5, 2026 00:49 View deployment

cursor bot reviewed Feb 5, 2026

View reviewed changes

src/output/fix-evaluation/patch-analysis.ts Show resolved Hide resolved

src/output/stale.ts Show resolved Hide resolved

vercel bot deployed to Preview February 5, 2026 01:45 View deployment

sentry-warden bot reviewed Feb 5, 2026

View reviewed changes

src/council/convene.ts Show resolved Hide resolved

cursor bot reviewed Feb 5, 2026

View reviewed changes

src/council/convene.ts Outdated Show resolved Hide resolved

vercel bot deployed to Preview February 5, 2026 02:17 View deployment

docs: Add GitHub suggestion syntax documentation

a43a76b

Add JSDoc comment explaining the GitHub suggestion block format with a link to the official documentation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

vercel bot deployed to Preview February 5, 2026 02:18 View deployment

dcramer mentioned this pull request Feb 5, 2026

Improve CI logging and investigate line number accuracy #105

Closed

cursor bot reviewed Feb 5, 2026

View reviewed changes

src/output/fix-evaluation/orchestrator.ts Outdated Show resolved Hide resolved

vercel bot deployed to Preview February 5, 2026 02:37 View deployment

vercel bot deployed to Preview February 5, 2026 02:47 View deployment

fix: Remove redundant relevantPatches check in fix evaluation loop

e5d1d54

The check was dead code since commentsToEvaluate is already filtered to only include comments present in relevantPatches. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

vercel bot deployed to Preview February 5, 2026 02:52 View deployment

cursor bot reviewed Feb 5, 2026

View reviewed changes

src/council/members/fix-judge.ts Outdated Show resolved Hide resolved

vercel bot deployed to Preview February 5, 2026 05:36 View deployment

dcramer force-pushed the feat/council-of-warden branch from 4d6eefb to e97f1a4 Compare February 5, 2026 05:39

vercel bot deployed to Preview February 5, 2026 05:39 View deployment

cursor bot reviewed Feb 5, 2026

View reviewed changes

src/output/fix-evaluation/orchestrator.ts Show resolved Hide resolved

sentry-warden bot requested changes Feb 5, 2026

View reviewed changes

src/output/fix-evaluation/index.ts Outdated Show resolved Hide resolved

vercel bot deployed to Preview February 5, 2026 05:55 View deployment

sentry-warden bot approved these changes Feb 5, 2026

View reviewed changes

vercel bot deployed to Preview February 5, 2026 06:32 View deployment

fix(renderer): Escape HTML in suggestion fallback case

36b1384

The description in renderSuggestion's deletion-only fallback was not being escaped, inconsistent with other LLM-generated content rendering. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

vercel bot deployed to Preview February 5, 2026 06:37 View deployment

sentry bot reviewed Feb 5, 2026

View reviewed changes

src/output/fix-evaluation/orchestrator.ts Show resolved Hide resolved

vercel bot deployed to Preview February 5, 2026 06:42 View deployment

vercel bot deployed to Preview February 5, 2026 07:03 View deployment

vercel bot deployed to Preview February 5, 2026 07:21 View deployment

dcramer changed the title ~~feat(council): Add Council of Warden for LLM-powered judgments~~ feat(council): Add Council of Warden + refactor review state management Feb 5, 2026

dcramer changed the title ~~feat(council): Add Council of Warden + refactor review state management~~ feat(council): Refactor GitHub Workflow Feb 5, 2026

cursor bot reviewed Feb 5, 2026

View reviewed changes

src/action/workflow/pr-workflow.ts Show resolved Hide resolved

dcramer and others added 2 commits February 4, 2026 23:36

vercel bot deployed to Preview February 5, 2026 07:37 View deployment

cursor bot reviewed Feb 5, 2026

View reviewed changes

dcramer mentioned this pull request Feb 9, 2026

feat(action): Add fix evaluation judges for unresolved comments #122

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(council): Refactor GitHub Workflow#103

feat(council): Refactor GitHub Workflow#103
dcramer wants to merge 26 commits intomainfrom
feat/council-of-warden

dcramer commented Feb 4, 2026 •

edited

Loading

Uh oh!

vercel bot commented Feb 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sentry-warden bot left a comment

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dcramer commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Council of Warden

Review State Refactor: DISMISS instead of APPROVE

Design philosophy

Uh oh!

vercel bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sentry-warden bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 5, 2026

Choose a reason for hiding this comment

Dead code: unresolvedWardenComments computed but never used

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dcramer commented Feb 4, 2026 •

edited

Loading

vercel bot commented Feb 4, 2026 •

edited

Loading