Skip to content

eval: tool_calls_min expectations penalize efficient models #367

@chaliy

Description

@chaliy

Problem

Two eval tasks use tool_calls_min expectations that penalize models that solve tasks efficiently in fewer bash calls:

error_graceful_parse — expects tool_calls_min:2, Codex solves in 1

The task asks to fix broken JSON and print a field. GPT-5.3-Codex writes a single script that fixes the JSON and parses it in one invocation. The eval expects ≥2 tool calls (presumably: one to inspect, one to fix+parse), but there's no reason a model shouldn't solve this in one call.

complex_todo_app — expects tool_calls_min:3, Codex solves in 1

The task asks to build a CLI TODO app and demonstrate it. Codex creates the script and runs the demo in a single bash invocation. The eval expects ≥3 calls, but a single well-structured script accomplishes everything.

Impact

  • GPT-5.3-Codex fails both tasks only due to tool_calls_min, scoring correctly on all other checks
  • Anthropic models naturally use more turns and pass, but this doesn't mean fewer turns is wrong

Proposal

Remove tool_calls_min from both tasks, or reduce them to tool_calls_min:1 (just verify the model actually calls bash). The eval should test outcomes, not prescribe how many calls a model makes.

Eval Reference

GPT-5.3-Codex eval run (2026-02-27): crates/bashkit-eval/results/eval-openai-responses-gpt-5.3-codex-2026-02-27-055543.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions