Hi, thank you for open-sourcing this great work!
I am trying to reproduce the MATH500 and AIME experiments in KVComm. I read the paper and the repository, and I found that the appendix provides prompt designs for GSM8K, MMLU, and HumanEval. However, I could not find a separate prompt design for the MATH500 and AIME experiments reported in the appendix/table.
I would like to ask a few questions:
- For the MATH500 and AIME experiments, did you directly reuse the GSM8K-style
MathSolver prompts?
- When using 2, 3, or 4 collaborating agents on MATH500/AIME, what were the exact agent roles?
- For example, was it:
- 2 agents: Math Solver + Mathematical Analyst
- 3 agents: Math Solver + Mathematical Analyst + Programming Expert
- 4 agents: Math Solver + Mathematical Analyst + Programming Expert + Inspector
- plus a final
FinalRefer agent for answer aggregation?
- Was the
Programming Expert agent enabled for MATH500/AIME, and was its Python code actually executed during inference?
- Did you modify the final-answer format for AIME, e.g., requiring the final answer to be an integer from 0 to 999?
- Could you share the exact prompt/config/script used to run the MATH500 and AIME experiments, if available?
The reason I am asking is that I am trying to evaluate KV reuse on mathematical reasoning benchmarks such as MATH500 and AIME, and I want to make sure my agent configuration and answer extraction are consistent with the original experiments.
Thanks a lot!
Hi, thank you for open-sourcing this great work!
I am trying to reproduce the MATH500 and AIME experiments in KVComm. I read the paper and the repository, and I found that the appendix provides prompt designs for GSM8K, MMLU, and HumanEval. However, I could not find a separate prompt design for the MATH500 and AIME experiments reported in the appendix/table.
I would like to ask a few questions:
MathSolverprompts?FinalReferagent for answer aggregation?Programming Expertagent enabled for MATH500/AIME, and was its Python code actually executed during inference?The reason I am asking is that I am trying to evaluate KV reuse on mathematical reasoning benchmarks such as MATH500 and AIME, and I want to make sure my agent configuration and answer extraction are consistent with the original experiments.
Thanks a lot!