Skip to content

Conversation

@juanmichelini
Copy link
Collaborator

@juanmichelini juanmichelini commented Dec 2, 2025

Summary

This PR introduces cost reporting for benchmark evaluations, providing both standalone script functionality and integrated cost reporting within the evaluation pipeline.

Key Features

1. Standalone Cost Reporting Script (benchmarks/utils/report_costs.py)

  • Comprehensive cost calculation: Extracts and sums accumulated_cost from all lines in JSONL files (not just the last line)
  • Multi-file support: Handles both main output.jsonl and critic attempt files (output.critic_attempt_*.jsonl)
  • Flexible input: Accepts either directory path or specific output.jsonl file path
  • Dual output formats:
    • Console output with formatted cost breakdown
    • JSON output saved as cost_report.jsonl in the same directory
  • Error handling: Robust error handling with detailed logging
  • Executable script: Can be run directly from command line

2. Integrated Cost Reporting in Evaluation Pipeline

  • Automatic integration: Cost reporting now runs as the final step in eval_infer.py
  • Shared functionality: Uses the same generate_cost_report function from the utils module
  • Reusable across benchmarks: The generate_cost_report function is available for other benchmark scripts

3. Accurate Cost Calculations

  • Fixed critical bug: Previously only captured the last line's cost; now correctly sums all accumulated costs
  • Proper field mapping:
    • only_main_output_cost: Cost from main output.jsonl file
    • sum_critic_files: Total cost from all critic attempt files
    • total_cost: Set to sum_critic_files (since main output is a subset of critic outputs)

Testing Results

Extensive testing on SWE-bench evaluation outputs:

  • Main output.jsonl: 500 lines, $1,139.35
  • Critic files: 3 files totaling 515 lines, $1,157.72
  • Total cost: $1,157.72 (correctly excludes double-counting)

Usage Examples

Standalone Script

# Run on directory containing output.jsonl files
python benchmarks/utils/report_costs.py /path/to/eval/directory

# Run on specific output.jsonl file
python benchmarks/utils/report_costs.py /path/to/output.jsonl

Integrated in Evaluation

Cost reporting now runs automatically as the final step when running evaluations.

Files Changed

  • benchmarks/utils/report_costs.py: New comprehensive cost reporting script
  • benchmarks/swe_bench/eval_infer.py: Integrated cost reporting as final step

Testing

  • ✅ Standalone script functionality tested with real evaluation data
  • ✅ Integration testing with eval_infer.py pipeline
  • ✅ JSON output format validation
  • ✅ Error handling for edge cases (missing files, malformed data)
  • ✅ Cross-platform compatibility (Linux/macOS)
  • ✅ Pre-commit hooks passing (ruff, pyright, pycodestyle)

juanmichelini and others added 6 commits December 2, 2025 15:09
…iles

This script processes JSONL evaluation output files and calculates:
- Individual costs for each file (using accumulated_cost from metrics)
- Aggregated costs for critic files (excluding main output.jsonl)
- Summary reporting with detailed breakdown

Usage: python benchmarks/utils/report_costs.py <directory_path>

Co-authored-by: openhands <openhands@all-hands.dev>
Previously the script only took the accumulated cost from the last line.
Now it correctly sums the accumulated costs from each line in the JSONL file.

This provides the total cost across all evaluation instances rather than
just the final accumulated cost.

Co-authored-by: openhands <openhands@all-hands.dev>
- Change 'Main Output File:' to 'Selected instance in Main output.jsonl only:'
- Remove 'Total Individual Files Cost' line from summary
- Rename 'Total Critic Files Only' to 'Sum Critic Files'

Co-authored-by: openhands <openhands@all-hands.dev>
- Added cost_report.jsonl output saved in same directory as input files
- Includes structured data: directory, timestamp, main_output, critic_files, summary
- Updated docstring and help text to document JSON output feature
- Added datetime import for timestamp generation
- Enhanced error handling for JSON file writing
…and set total_cost to equal sum_critic_files

- Renamed main_output_cost to only_main_output_cost to clarify it's just the subset cost
- Changed total_cost to equal sum_critic_files since main output.jsonl is a subset of critic outputs
- Updated logic for output-only case to set total_cost to 0 (no critic files means no total cost)
- Maintains backward compatibility for critic-only scenarios
- Added generate_cost_report function to benchmarks/utils/report_costs.py for reuse across different benchmarks
- Updated eval_infer.py to import generate_cost_report from shared module
- Removed duplicate local function definition from eval_infer.py
- Function takes input_file path, extracts directory, and calls calculate_costs with error handling
- Enables cost reporting functionality to be easily integrated into other benchmark evaluation scripts
- Fixed type error: convert Path to str when calling calculate_costs

Co-authored-by: openhands <openhands@all-hands.dev>
parser = argparse.ArgumentParser(
description="Calculate costs from JSONL evaluation output files and save detailed report",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use epilog instead of adding this epilog to the description?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's so the examples go after the params not before. Like here https://www.blog.pythonlibrary.org/2015/10/08/a-intro-to-argparse/#:~:text=Getting%20Started,the%20command%2Dline%20after%20all. In this case both would work though

@juanmichelini juanmichelini merged commit f7d30e5 into main Dec 4, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants