feat(swebench-pro): initial support for ScaleAI/SWE-bench_Pro#16
Merged
Conversation
8620f1d to
6783343
Compare
Adds an agent execution pipeline for SWE-bench Pro mirroring the
existing Verified pipeline. Key differences from Verified: Docker images
come from jefzda/sweap-images (tag from the dataset's dockerhub_tag
field), evaluation runs per-instance run_script.sh/parser.py scripts
from the SWE-bench_Pro-os submodule rather than the swebench library,
and grading checks fail_to_pass ∪ pass_to_pass against output.json.
New files:
- src/tasks/swe_bench_pro.py — dataset adapter (ScaleAI/SWE-bench_Pro)
- src/agents/swe_bench_pro_environment.py — Modal environment with Pro eval
- run_swebench_pro_agent.py — agent runner
- slurm/swebench_pro_run.sh — SLURM launcher
- configs/tasks/swe_bench_pro.yaml
- configs/runs/qwen36_27b_swebench_pro_{test,full}.yaml
- tests/test_swebench_pro_evaluate_modal.py
- SWE-bench_Pro-os submodule (scaleapi/SWE-bench_Pro-os)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6783343 to
81501d4
Compare
Add Pro-specific labeler (run_labeler_pro.py + src/labeling/swebench_pro_labeler.py) that replays diffs using the Pro eval harness (entryscript.sh → output.json). Add SwebenchProLabelerConfig and to_pro_agent_config() in configs.py. Fix agent prompt to use /app instead of /testbed, and drop redundant preamble from the task string since the problem statement already contains that info. Add run configs and labeling configs for laguna_xs2 on swebench-pro test set. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pass@10 Add laguna_xs2_swebench_pro_full.yaml (10 runs, 20 workers, 5 shards × 20 = 100 concurrent Modal containers) and swebench_pro_run_thin.sh (4 GPUs, no fat constraint, 32h wall time) for the full Pro generation run. Also fix export_swebench_dashboard to always use run_id for cache lookup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds an agent execution pipeline for SWE-bench Pro mirroring the existing Verified pipeline. Key differences from Verified: Docker images come from jefzda/sweap-images (tag from the dataset's dockerhub_tag field), evaluation runs per-instance run_script.sh/parser.py scripts from the SWE-bench_Pro-os submodule rather than the swebench library, and grading checks fail_to_pass ∪ pass_to_pass against output.json.
New files: