Add comprehensive cost reporting system for benchmark evaluations #126

juanmichelini · 2025-12-02T23:57:28Z

Summary

This PR introduces cost reporting for benchmark evaluations, providing both standalone script functionality and integrated cost reporting within the evaluation pipeline.

Key Features

1. Standalone Cost Reporting Script (`benchmarks/utils/report_costs.py`)

Comprehensive cost calculation: Extracts and sums accumulated_cost from all lines in JSONL files (not just the last line)
Multi-file support: Handles both main output.jsonl and critic attempt files (output.critic_attempt_*.jsonl)
Flexible input: Accepts either directory path or specific output.jsonl file path
Dual output formats:
- Console output with formatted cost breakdown
- JSON output saved as cost_report.jsonl in the same directory
Error handling: Robust error handling with detailed logging
Executable script: Can be run directly from command line

2. Integrated Cost Reporting in Evaluation Pipeline

Automatic integration: Cost reporting now runs as the final step in eval_infer.py
Shared functionality: Uses the same generate_cost_report function from the utils module
Reusable across benchmarks: The generate_cost_report function is available for other benchmark scripts

3. Accurate Cost Calculations

Fixed critical bug: Previously only captured the last line's cost; now correctly sums all accumulated costs
Proper field mapping:
- only_main_output_cost: Cost from main output.jsonl file
- sum_critic_files: Total cost from all critic attempt files
- total_cost: Set to sum_critic_files (since main output is a subset of critic outputs)

Testing Results

Extensive testing on SWE-bench evaluation outputs:

Main output.jsonl: 500 lines, $1,139.35
Critic files: 3 files totaling 515 lines, $1,157.72
Total cost: $1,157.72 (correctly excludes double-counting)

Usage Examples

Standalone Script

# Run on directory containing output.jsonl files
python benchmarks/utils/report_costs.py /path/to/eval/directory

# Run on specific output.jsonl file
python benchmarks/utils/report_costs.py /path/to/output.jsonl

Integrated in Evaluation

Cost reporting now runs automatically as the final step when running evaluations.

Files Changed

benchmarks/utils/report_costs.py: New comprehensive cost reporting script
benchmarks/swe_bench/eval_infer.py: Integrated cost reporting as final step

Testing

✅ Standalone script functionality tested with real evaluation data
✅ Integration testing with eval_infer.py pipeline
✅ JSON output format validation
✅ Error handling for edge cases (missing files, malformed data)
✅ Cross-platform compatibility (Linux/macOS)
✅ Pre-commit hooks passing (ruff, pyright, pycodestyle)

…iles This script processes JSONL evaluation output files and calculates: - Individual costs for each file (using accumulated_cost from metrics) - Aggregated costs for critic files (excluding main output.jsonl) - Summary reporting with detailed breakdown Usage: python benchmarks/utils/report_costs.py <directory_path> Co-authored-by: openhands <openhands@all-hands.dev>

Previously the script only took the accumulated cost from the last line. Now it correctly sums the accumulated costs from each line in the JSONL file. This provides the total cost across all evaluation instances rather than just the final accumulated cost. Co-authored-by: openhands <openhands@all-hands.dev>

- Change 'Main Output File:' to 'Selected instance in Main output.jsonl only:' - Remove 'Total Individual Files Cost' line from summary - Rename 'Total Critic Files Only' to 'Sum Critic Files' Co-authored-by: openhands <openhands@all-hands.dev>

- Added cost_report.jsonl output saved in same directory as input files - Includes structured data: directory, timestamp, main_output, critic_files, summary - Updated docstring and help text to document JSON output feature - Added datetime import for timestamp generation - Enhanced error handling for JSON file writing

…and set total_cost to equal sum_critic_files - Renamed main_output_cost to only_main_output_cost to clarify it's just the subset cost - Changed total_cost to equal sum_critic_files since main output.jsonl is a subset of critic outputs - Updated logic for output-only case to set total_cost to 0 (no critic files means no total cost) - Maintains backward compatibility for critic-only scenarios

- Added generate_cost_report function to benchmarks/utils/report_costs.py for reuse across different benchmarks - Updated eval_infer.py to import generate_cost_report from shared module - Removed duplicate local function definition from eval_infer.py - Function takes input_file path, extracts directory, and calls calculate_costs with error handling - Enables cost reporting functionality to be easily integrated into other benchmark evaluation scripts - Fixed type error: convert Path to str when calling calculate_costs Co-authored-by: openhands <openhands@all-hands.dev>

…mj/cost-reporting-integration

simonrosenberg · 2025-12-04T14:54:06Z

benchmarks/utils/report_costs.py

+    parser = argparse.ArgumentParser(
+        description="Calculate costs from JSONL evaluation output files and save detailed report",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""


why use epilog instead of adding this epilog to the description?

it's so the examples go after the params not before. Like here https://www.blog.pythonlibrary.org/2015/10/08/a-intro-to-argparse/#:~:text=Getting%20Started,the%20command%2Dline%20after%20all. In this case both would work though

juanmichelini and others added 6 commits December 2, 2025 15:09

Update output formatting labels

4722fa8

- Change 'Main Output File:' to 'Selected instance in Main output.jsonl only:' - Remove 'Total Individual Files Cost' line from summary - Rename 'Total Critic Files Only' to 'Sum Critic Files' Co-authored-by: openhands <openhands@all-hands.dev>

juanmichelini requested a review from simonrosenberg December 3, 2025 15:51

Merge branch 'main' of https://github.com/OpenHands/benchmarks into j…

d5d13e1

…mj/cost-reporting-integration

simonrosenberg reviewed Dec 4, 2025

View reviewed changes

simonrosenberg approved these changes Dec 4, 2025

View reviewed changes

juanmichelini merged commit f7d30e5 into main Dec 4, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add comprehensive cost reporting system for benchmark evaluations #126

Add comprehensive cost reporting system for benchmark evaluations #126

Uh oh!

juanmichelini commented Dec 2, 2025 •

edited

Loading

Uh oh!

simonrosenberg Dec 4, 2025

Uh oh!

juanmichelini Dec 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add comprehensive cost reporting system for benchmark evaluations #126

Add comprehensive cost reporting system for benchmark evaluations #126

Uh oh!

Conversation

juanmichelini commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

1. Standalone Cost Reporting Script (benchmarks/utils/report_costs.py)

2. Integrated Cost Reporting in Evaluation Pipeline

3. Accurate Cost Calculations

Testing Results

Usage Examples

Standalone Script

Integrated in Evaluation

Files Changed

Testing

Uh oh!

simonrosenberg Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

juanmichelini Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

juanmichelini commented Dec 2, 2025 •

edited

Loading

1. Standalone Cost Reporting Script (`benchmarks/utils/report_costs.py`)