fix: Replace done_callback with coroutine chain for judge tracking#147
fix: Replace done_callback with coroutine chain for judge tracking#147jsonbailey wants to merge 1 commit intomainfrom
Conversation
`_track_judge_results` previously used `add_done_callback` to fire `track_judge_result()` after evaluation completed, but callbacks run outside the asyncio task scheduler and can execute at unpredictable times. Replace with a single `_run_and_track` coroutine wrapped in a new `asyncio.create_task`, so that awaiting `response.evaluations` guarantees both evaluation and tracker calls complete in sequence. Add `test_managed_model.py` covering: invoke() returns before evaluations resolve; awaiting evaluations collects results; tracking fires inside the awaited chain (not before); failed judge results do not trigger tracking; noop evaluator returns an empty list. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
a2402ea to
381cf75
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 381cf75. Configure here.
| for r in results: | ||
| if r.success: | ||
| tracker.track_judge_result(r) | ||
| return results |
There was a problem hiding this comment.
Missing exception handling loses evaluation results on tracking failure
Medium Severity
The _run_and_track coroutine has no try/except around the tracker.track_judge_result(r) call. If tracking raises (e.g., the LD client is shut down or track() fails internally), the exception propagates through the wrapper task, and callers awaiting response.evaluations receive an exception instead of the List[JudgeResult]. The old add_done_callback approach was inherently resilient to this — asyncio catches callback exceptions and logs them without affecting the task's result. The new inline approach loses that isolation, meaning a tracking failure now destroys the evaluation results that were successfully computed.
Reviewed by Cursor Bugbot for commit 381cf75. Configure here.


Summary
add_done_callbackinManagedModel._track_judge_resultswith a proper_run_and_trackcoroutine wrapped inasyncio.create_task, so that awaitingresponse.evaluationsguarantees both evaluation andtracker.track_judge_result()calls complete in sequencetest_managed_model.pywith 5 tests covering: return-before-resolve, collect results, tracking fires inside the awaited chain, failed results skip tracking, noop evaluator returns empty listTest plan
test_managed_model.pytests added and passinghello-python-aiagent🤖 Generated with Claude Code
Note
Medium Risk
Changes async scheduling for judge evaluations/tracking, which can affect ordering and side effects (metrics emission) during
ManagedModel.invokeand when consumers awaitresponse.evaluations. Covered by new tests but still touches concurrency behavior.Overview
Fixes judge-result tracking determinism.
ManagedModel._track_judge_resultsnow wraps evaluator execution in an awaited coroutine chain and returns a newasynciotask, so awaitingresponse.evaluationsreliably includes both evaluation completion andtracker.track_judge_result()side effects (instead of relying on aadd_done_callback).Adds a focused
test_managed_model.pysuite validating thatinvoke()returns before evaluations finish, that awaitingevaluationsyields results, and that tracking fires only for successful judge results (including a noop-evaluator case).Reviewed by Cursor Bugbot for commit 381cf75. Bugbot is set up for automated code reviews on this repo. Configure here.