Skip to content

feat(swebench-pro): initial support for ScaleAI/SWE-bench_Pro#16

Merged
andre15silva merged 5 commits into
mainfrom
andre/feat/swebench-pro
Jun 3, 2026
Merged

feat(swebench-pro): initial support for ScaleAI/SWE-bench_Pro#16
andre15silva merged 5 commits into
mainfrom
andre/feat/swebench-pro

Conversation

@andre15silva

Copy link
Copy Markdown
Member

Adds an agent execution pipeline for SWE-bench Pro mirroring the existing Verified pipeline. Key differences from Verified: Docker images come from jefzda/sweap-images (tag from the dataset's dockerhub_tag field), evaluation runs per-instance run_script.sh/parser.py scripts from the SWE-bench_Pro-os submodule rather than the swebench library, and grading checks fail_to_pass ∪ pass_to_pass against output.json.

New files:

  • src/tasks/swe_bench_pro.py — dataset adapter (ScaleAI/SWE-bench_Pro)
  • src/agents/swe_bench_pro_environment.py — Modal environment with Pro eval
  • run_swebench_pro_agent.py — agent runner
  • slurm/swebench_pro_run.sh — SLURM launcher
  • configs/tasks/swe_bench_pro.yaml
  • configs/runs/qwen36_27b_swebench_pro_{test,full}.yaml
  • tests/test_swebench_pro_evaluate_modal.py
  • SWE-bench_Pro-os submodule (scaleapi/SWE-bench_Pro-os)

@andre15silva andre15silva force-pushed the andre/feat/swebench-pro branch 2 times, most recently from 8620f1d to 6783343 Compare June 2, 2026 16:27
andre15silva and others added 2 commits June 2, 2026 18:38
Adds an agent execution pipeline for SWE-bench Pro mirroring the
existing Verified pipeline. Key differences from Verified: Docker images
come from jefzda/sweap-images (tag from the dataset's dockerhub_tag
field), evaluation runs per-instance run_script.sh/parser.py scripts
from the SWE-bench_Pro-os submodule rather than the swebench library,
and grading checks fail_to_pass ∪ pass_to_pass against output.json.

New files:
- src/tasks/swe_bench_pro.py — dataset adapter (ScaleAI/SWE-bench_Pro)
- src/agents/swe_bench_pro_environment.py — Modal environment with Pro eval
- run_swebench_pro_agent.py — agent runner
- slurm/swebench_pro_run.sh — SLURM launcher
- configs/tasks/swe_bench_pro.yaml
- configs/runs/qwen36_27b_swebench_pro_{test,full}.yaml
- tests/test_swebench_pro_evaluate_modal.py
- SWE-bench_Pro-os submodule (scaleapi/SWE-bench_Pro-os)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@andre15silva andre15silva force-pushed the andre/feat/swebench-pro branch from 6783343 to 81501d4 Compare June 2, 2026 16:39
andre15silva and others added 2 commits June 2, 2026 22:21
Add Pro-specific labeler (run_labeler_pro.py + src/labeling/swebench_pro_labeler.py)
that replays diffs using the Pro eval harness (entryscript.sh → output.json).
Add SwebenchProLabelerConfig and to_pro_agent_config() in configs.py.
Fix agent prompt to use /app instead of /testbed, and drop redundant preamble
from the task string since the problem statement already contains that info.
Add run configs and labeling configs for laguna_xs2 on swebench-pro test set.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pass@10

Add laguna_xs2_swebench_pro_full.yaml (10 runs, 20 workers, 5 shards × 20 = 100
concurrent Modal containers) and swebench_pro_run_thin.sh (4 GPUs, no fat
constraint, 32h wall time) for the full Pro generation run.
Also fix export_swebench_dashboard to always use run_id for cache lookup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@andre15silva andre15silva marked this pull request as ready for review June 3, 2026 12:07
@andre15silva andre15silva merged commit 7508674 into main Jun 3, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant