-
Notifications
You must be signed in to change notification settings - Fork 14
Add comprehensive cost reporting system for benchmark evaluations #126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…iles This script processes JSONL evaluation output files and calculates: - Individual costs for each file (using accumulated_cost from metrics) - Aggregated costs for critic files (excluding main output.jsonl) - Summary reporting with detailed breakdown Usage: python benchmarks/utils/report_costs.py <directory_path> Co-authored-by: openhands <openhands@all-hands.dev>
Previously the script only took the accumulated cost from the last line. Now it correctly sums the accumulated costs from each line in the JSONL file. This provides the total cost across all evaluation instances rather than just the final accumulated cost. Co-authored-by: openhands <openhands@all-hands.dev>
- Change 'Main Output File:' to 'Selected instance in Main output.jsonl only:' - Remove 'Total Individual Files Cost' line from summary - Rename 'Total Critic Files Only' to 'Sum Critic Files' Co-authored-by: openhands <openhands@all-hands.dev>
- Added cost_report.jsonl output saved in same directory as input files - Includes structured data: directory, timestamp, main_output, critic_files, summary - Updated docstring and help text to document JSON output feature - Added datetime import for timestamp generation - Enhanced error handling for JSON file writing
…and set total_cost to equal sum_critic_files - Renamed main_output_cost to only_main_output_cost to clarify it's just the subset cost - Changed total_cost to equal sum_critic_files since main output.jsonl is a subset of critic outputs - Updated logic for output-only case to set total_cost to 0 (no critic files means no total cost) - Maintains backward compatibility for critic-only scenarios
- Added generate_cost_report function to benchmarks/utils/report_costs.py for reuse across different benchmarks - Updated eval_infer.py to import generate_cost_report from shared module - Removed duplicate local function definition from eval_infer.py - Function takes input_file path, extracts directory, and calls calculate_costs with error handling - Enables cost reporting functionality to be easily integrated into other benchmark evaluation scripts - Fixed type error: convert Path to str when calling calculate_costs Co-authored-by: openhands <openhands@all-hands.dev>
…mj/cost-reporting-integration
| parser = argparse.ArgumentParser( | ||
| description="Calculate costs from JSONL evaluation output files and save detailed report", | ||
| formatter_class=argparse.RawDescriptionHelpFormatter, | ||
| epilog=""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use epilog instead of adding this epilog to the description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's so the examples go after the params not before. Like here https://www.blog.pythonlibrary.org/2015/10/08/a-intro-to-argparse/#:~:text=Getting%20Started,the%20command%2Dline%20after%20all. In this case both would work though
Summary
This PR introduces cost reporting for benchmark evaluations, providing both standalone script functionality and integrated cost reporting within the evaluation pipeline.
Key Features
1. Standalone Cost Reporting Script (
benchmarks/utils/report_costs.py)accumulated_costfrom all lines in JSONL files (not just the last line)output.jsonland critic attempt files (output.critic_attempt_*.jsonl)output.jsonlfile pathcost_report.jsonlin the same directory2. Integrated Cost Reporting in Evaluation Pipeline
eval_infer.pygenerate_cost_reportfunction from the utils modulegenerate_cost_reportfunction is available for other benchmark scripts3. Accurate Cost Calculations
only_main_output_cost: Cost from mainoutput.jsonlfilesum_critic_files: Total cost from all critic attempt filestotal_cost: Set tosum_critic_files(since main output is a subset of critic outputs)Testing Results
Extensive testing on SWE-bench evaluation outputs:
Usage Examples
Standalone Script
Integrated in Evaluation
Cost reporting now runs automatically as the final step when running evaluations.
Files Changed
benchmarks/utils/report_costs.py: New comprehensive cost reporting scriptbenchmarks/swe_bench/eval_infer.py: Integrated cost reporting as final stepTesting
eval_infer.pypipeline