Commit ab60b30
committed
Add IMO-Bench evaluation scripts for answers and proofs
Introduces two new scripts: eval_imobench_answer.py for evaluating short-answer mathematical problems from the AnswerBench dataset, and eval_imobench_proof.py for evaluating rigorous proof problems from the ProofBench dataset. Both scripts support model evaluation, result saving, and detailed performance analysis.1 parent dba3950 commit ab60b30
2 files changed
+1007
-0
lines changed
0 commit comments