feat(swebench-pro): initial support for ScaleAI/SWE-bench_Pro by andre15silva · Pull Request #16 · ASSERT-KTH/program-probes

andre15silva · 2026-06-01T08:23:33Z

Adds an agent execution pipeline for SWE-bench Pro mirroring the existing Verified pipeline. Key differences from Verified: Docker images come from jefzda/sweap-images (tag from the dataset's dockerhub_tag field), evaluation runs per-instance run_script.sh/parser.py scripts from the SWE-bench_Pro-os submodule rather than the swebench library, and grading checks fail_to_pass ∪ pass_to_pass against output.json.

New files:

src/tasks/swe_bench_pro.py — dataset adapter (ScaleAI/SWE-bench_Pro)
src/agents/swe_bench_pro_environment.py — Modal environment with Pro eval
run_swebench_pro_agent.py — agent runner
slurm/swebench_pro_run.sh — SLURM launcher
configs/tasks/swe_bench_pro.yaml
configs/runs/qwen36_27b_swebench_pro_{test,full}.yaml
tests/test_swebench_pro_evaluate_modal.py
SWE-bench_Pro-os submodule (scaleapi/SWE-bench_Pro-os)

Adds an agent execution pipeline for SWE-bench Pro mirroring the existing Verified pipeline. Key differences from Verified: Docker images come from jefzda/sweap-images (tag from the dataset's dockerhub_tag field), evaluation runs per-instance run_script.sh/parser.py scripts from the SWE-bench_Pro-os submodule rather than the swebench library, and grading checks fail_to_pass ∪ pass_to_pass against output.json. New files: - src/tasks/swe_bench_pro.py — dataset adapter (ScaleAI/SWE-bench_Pro) - src/agents/swe_bench_pro_environment.py — Modal environment with Pro eval - run_swebench_pro_agent.py — agent runner - slurm/swebench_pro_run.sh — SLURM launcher - configs/tasks/swe_bench_pro.yaml - configs/runs/qwen36_27b_swebench_pro_{test,full}.yaml - tests/test_swebench_pro_evaluate_modal.py - SWE-bench_Pro-os submodule (scaleapi/SWE-bench_Pro-os) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add Pro-specific labeler (run_labeler_pro.py + src/labeling/swebench_pro_labeler.py) that replays diffs using the Pro eval harness (entryscript.sh → output.json). Add SwebenchProLabelerConfig and to_pro_agent_config() in configs.py. Fix agent prompt to use /app instead of /testbed, and drop redundant preamble from the task string since the problem statement already contains that info. Add run configs and labeling configs for laguna_xs2 on swebench-pro test set. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…pass@10 Add laguna_xs2_swebench_pro_full.yaml (10 runs, 20 workers, 5 shards × 20 = 100 concurrent Modal containers) and swebench_pro_run_thin.sh (4 GPUs, no fat constraint, 32h wall time) for the full Pro generation run. Also fix export_swebench_dashboard to always use run_id for cache lookup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

andre15silva force-pushed the main branch from 6ca656f to 3d7b934 Compare June 1, 2026 09:35

andre15silva force-pushed the andre/feat/swebench-pro branch 2 times, most recently from 8620f1d to 6783343 Compare June 2, 2026 16:27

andre15silva and others added 2 commits June 2, 2026 18:38

fix test

81501d4

andre15silva force-pushed the andre/feat/swebench-pro branch from 6783343 to 81501d4 Compare June 2, 2026 16:39

andre15silva and others added 2 commits June 2, 2026 22:21

andre15silva marked this pull request as ready for review June 3, 2026 12:07

add run_eval mode to probe

764b17b

andre15silva merged commit 7508674 into main Jun 3, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(swebench-pro): initial support for ScaleAI/SWE-bench_Pro#16

feat(swebench-pro): initial support for ScaleAI/SWE-bench_Pro#16
andre15silva merged 5 commits into
mainfrom
andre/feat/swebench-pro

andre15silva commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

andre15silva commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant