eval: tool_calls_min expectations penalize efficient models

## Problem

Two eval tasks use `tool_calls_min` expectations that penalize models that solve tasks efficiently in fewer bash calls:

### `error_graceful_parse` — expects `tool_calls_min:2`, Codex solves in 1

The task asks to fix broken JSON and print a field. GPT-5.3-Codex writes a single script that fixes the JSON and parses it in one invocation. The eval expects ≥2 tool calls (presumably: one to inspect, one to fix+parse), but there's no reason a model shouldn't solve this in one call.

### `complex_todo_app` — expects `tool_calls_min:3`, Codex solves in 1

The task asks to build a CLI TODO app and demonstrate it. Codex creates the script and runs the demo in a single bash invocation. The eval expects ≥3 calls, but a single well-structured script accomplishes everything.

## Impact

- GPT-5.3-Codex fails both tasks **only** due to `tool_calls_min`, scoring correctly on all other checks
- Anthropic models naturally use more turns and pass, but this doesn't mean fewer turns is wrong

## Proposal

Remove `tool_calls_min` from both tasks, or reduce them to `tool_calls_min:1` (just verify the model actually calls bash). The eval should test *outcomes*, not prescribe *how many calls* a model makes.

## Eval Reference

GPT-5.3-Codex eval run (2026-02-27): `crates/bashkit-eval/results/eval-openai-responses-gpt-5.3-codex-2026-02-27-055543.md`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval: tool_calls_min expectations penalize efficient models #367

Problem

`error_graceful_parse` — expects `tool_calls_min:2`, Codex solves in 1

`complex_todo_app` — expects `tool_calls_min:3`, Codex solves in 1

Impact

Proposal

Eval Reference

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

eval: tool_calls_min expectations penalize efficient models #367

Description

Problem

error_graceful_parse — expects tool_calls_min:2, Codex solves in 1

complex_todo_app — expects tool_calls_min:3, Codex solves in 1

Impact

Proposal

Eval Reference

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`error_graceful_parse` — expects `tool_calls_min:2`, Codex solves in 1

`complex_todo_app` — expects `tool_calls_min:3`, Codex solves in 1