-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Problem
Two eval tasks use tool_calls_min expectations that penalize models that solve tasks efficiently in fewer bash calls:
error_graceful_parse — expects tool_calls_min:2, Codex solves in 1
The task asks to fix broken JSON and print a field. GPT-5.3-Codex writes a single script that fixes the JSON and parses it in one invocation. The eval expects ≥2 tool calls (presumably: one to inspect, one to fix+parse), but there's no reason a model shouldn't solve this in one call.
complex_todo_app — expects tool_calls_min:3, Codex solves in 1
The task asks to build a CLI TODO app and demonstrate it. Codex creates the script and runs the demo in a single bash invocation. The eval expects ≥3 calls, but a single well-structured script accomplishes everything.
Impact
- GPT-5.3-Codex fails both tasks only due to
tool_calls_min, scoring correctly on all other checks - Anthropic models naturally use more turns and pass, but this doesn't mean fewer turns is wrong
Proposal
Remove tool_calls_min from both tasks, or reduce them to tool_calls_min:1 (just verify the model actually calls bash). The eval should test outcomes, not prescribe how many calls a model makes.
Eval Reference
GPT-5.3-Codex eval run (2026-02-27): crates/bashkit-eval/results/eval-openai-responses-gpt-5.3-codex-2026-02-27-055543.md