Skip to content

feat: Add judge evaluation support to agent graphs#142

Merged
jsonbailey merged 10 commits intomainfrom
jb/aic-2267/agent-graph-judge-support
Apr 28, 2026
Merged

feat: Add judge evaluation support to agent graphs#142
jsonbailey merged 10 commits intomainfrom
jb/aic-2267/agent-graph-judge-support

Conversation

@jsonbailey
Copy link
Copy Markdown
Contributor

@jsonbailey jsonbailey commented Apr 24, 2026

Summary

  • Adds a new Evaluator class that coordinates per-node judge evaluation; evaluate() returns an asyncio.Task so evaluation fires immediately and is awaited before the graph run returns
  • AIAgentConfig (and AICompletionConfig) now carry a pre-built Evaluator as a kw_only dataclass field, constructed eagerly in client._build_evaluator()
  • LangGraph: LangGraphAgentGraphRunner stores per-node eval tasks in _pending_eval_tasks during node execution; LangChainCallbackHandler.flush() (now async) awaits them and calls track_judge_result via the same AIConfigTracker used for that node's LLM metrics
  • OpenAI: OpenAIAgentGraphRunner fires judge evaluation at handoff and final-segment points, tracked via the node's config tracker
  • Evaluator.noop() provides a null-object default so nodes without a judgeConfiguration require no special handling

Test plan

  • All existing unit tests pass (make test — 248 tests across 3 packages)
  • Lint passes (make lint)
  • Manual e2e: run langgraph-multi-agent-example or chat-judge-example via hello-python-ai pointing at this worktree and verify judge events appear in the LD events stream

Closes AIC-2267

🤖 Generated with Claude Code


Note

Medium Risk
Introduces asynchronous judge-evaluation execution and wires results into both ManagedModel and agent-graph runners, changing result types and tracker flushing behavior. Risk is moderate due to new concurrency/task handling and API surface changes around create_tracker and evaluations fields.

Overview
Adds a new Evaluator abstraction and threads it through AICompletionConfig/AIAgentConfig so judge evaluations can be kicked off per invocation and tracked automatically.

Updates ManagedModel.invoke() to start evaluation via the config’s evaluator, attach a completion callback to emit track_judge_result, and changes ModelResponse.evaluations to carry an asyncio.Task (while AgentGraphResult now includes collected judge results).

Extends LangGraph execution to schedule per-node evaluation tasks during node invocation, store them per-run using a ContextVar, and make LDMetricsCallbackHandler.flush() async so it can await tasks, track successful judge results per node, and return all results.

Refactors judge initialization in LDAIClient to build evaluators eagerly (including new default_ai_provider plumbing), removes async judge setup from create_model(), and tightens AgentGraphDefinition.create_tracker to be required; OpenAI agent-graph tracking is aligned to always use a graph tracker and now returns token usage in LDAIMetrics.

Reviewed by Cursor Bugbot for commit e2f5b93. Bugbot is set up for automated code reviews on this repo. Configure here.

Adds per-node judge evaluation to agent graph execution. Each AIAgentConfig
now carries a pre-built Evaluator (mirroring AICompletionConfig) that the
provider-specific AgentGraphRunner invokes after each node's model response.
Results are tracked via the same AIConfigTracker used for that node's LLM
metrics, ensuring evaluation data is correlated correctly.

Key changes:
- New Evaluator class coordinating multiple judges; evaluate() returns an
  asyncio Task so evaluation fires immediately and is awaited in flush()
- AIAgentConfig and AICompletionConfig carry an eager evaluator (kw_only field)
- LangGraph runner stores per-node eval tasks in _pending_eval_tasks and
  flushes them via the callback handler's async flush() method
- OpenAI runner fires judge evaluation at handoff and final-segment points
- client._build_evaluator() handles empty/None judge config via Evaluator.noop()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread packages/sdk/server-ai/src/ldai/evaluator.py Outdated
@jsonbailey jsonbailey marked this pull request as ready for review April 24, 2026 20:03
@jsonbailey jsonbailey requested a review from a team as a code owner April 24, 2026 20:03
@jsonbailey jsonbailey changed the title feat: add judge evaluation support to agent graphs feat: Add judge evaluation support to agent graphs Apr 24, 2026
Comment thread packages/sdk/server-ai/src/ldai/managed_model.py Outdated
Comment thread packages/sdk/server-ai/src/ldai/client.py
Comment thread packages/sdk/server-ai/src/ldai/client.py
- Remove quotes from asyncio.Task return type in Evaluator.evaluate()
- Update ModelResponse.evaluations type to asyncio.Task[List[JudgeResult]]
- Forward default_ai_provider to __evaluate_agent in create_agent

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_pending_eval_tasks was keyed by node key, so repeated visits (e.g. cycles
or tool loops) would silently overwrite earlier eval tasks. Changed to
Dict[str, List[Task]] with setdefault/append so all invocations are tracked.
flush() now iterates the full list per node.

Also wraps the long __evaluate_agent call in create_agent to satisfy E501.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace asyncio.create_task fire-and-forget with proper task collection
and awaiting in both OpenAI and LangGraph runners, ensuring judge results
are tracked reliably. Use ContextVar in LangGraph runner to isolate
pending eval task state across concurrent run() calls.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread packages/sdk/server-ai/src/ldai/client.py
Comment thread packages/sdk/server-ai/src/ldai/client.py
…_ai_provider

- Remove if-tracker guards in both runners since create_tracker is
  always set on enabled graphs (disabled graphs are filtered before
  runner creation), also fixing token_usage NameError when tracker=None
- Forward variables through _build_evaluator to _initialize_judges so
  judge templates can interpolate user-provided variables
- Add default_ai_provider param to agent_graph() and forward it to
  __evaluate_agent so graph node evaluators use the correct provider;
  propagate from create_agent_graph() as well

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread packages/sdk/server-ai/src/ldai/managed_model.py Outdated
- Redesign ManagedModel._track_judge_results to call evaluator.evaluate()
  internally and attach tracking via add_done_callback, returning the task
  so the reference is held by ModelResponse.evaluations — no GC risk
- Warn instead of silently dropping eval tasks when the LangGraph ContextVar
  is unexpectedly unset in a node's execution context
- Make AgentGraphDefinition.create_tracker a required parameter; all
  production and test call sites already supply it, and this matches the
  invariant that runners only execute on enabled (always-tracked) graphs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread packages/sdk/server-ai/src/ldai/providers/types.py Outdated
Both branches independently added evaluator/judge logic (this branch)
and root-level tools map support (main). Conflicts in _completion_config
and __evaluate_agent resolved by keeping both changes. Parameter order
swap for track_metrics_of_async auto-resolved.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…iew items

- Fix AgentGraphResult.evaluations type from Optional[List[Any]] to Optional[List[JudgeResult]]
- Populate evaluations in both LangGraph and OpenAI runners with all judge results
- Remove stray `if tracker:` guard in OpenAI _handle_handoff (tracker is always set)
- Add comment documenting why output_text is empty at handoff time in OpenAI runner
- flush() now returns List[JudgeResult] instead of None

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2a15009. Configure here.

Comment thread packages/sdk/server-ai/src/ldai/client.py
jsonbailey and others added 2 commits April 27, 2026 17:46
- Add `from __future__ import annotations` to evaluator.py so the
  self-referential `-> Evaluator` return type does not need quoting
- Log a warning when a judge fails to initialize in _initialize_judges
  instead of silently swallowing the exception

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The OpenAI Agents SDK does not expose a node's text output at handoff
time, making it impossible to evaluate intermediate nodes against real
output. Rather than evaluating against an empty string, remove
evaluation support from the OpenAI runner entirely until the SDK
provides a suitable API.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jsonbailey jsonbailey merged commit 3d5a6a9 into main Apr 28, 2026
46 checks passed
@jsonbailey jsonbailey deleted the jb/aic-2267/agent-graph-judge-support branch April 28, 2026 13:40
@github-actions github-actions Bot mentioned this pull request Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants