Skip to content

Add MLE-bench agent adapter#191

Open
RitwijParmar wants to merge 2 commits into
plexe-ai:mainfrom
RitwijParmar:codex/mlebench-agent-adapter
Open

Add MLE-bench agent adapter#191
RitwijParmar wants to merge 2 commits into
plexe-ai:mainfrom
RitwijParmar:codex/mlebench-agent-adapter

Conversation

@RitwijParmar
Copy link
Copy Markdown

Summary

  • adds a self-contained MLE-bench agent adapter under tests/benchmark/mlebench
  • includes Docker/config/start entrypoints for the official openai/mle-bench runner
  • adds a Python runner that discovers train/test/sample submission files, runs Plexe, shapes predictions into /home/submission/submission.csv, and writes debug metadata/logs
  • adds unit tests for dataset discovery, submission shaping, and existing submission copy behavior

Why

This follows up on #124. The issue asks for a way to run Plexe on MLE-bench Lite with 10 iterations and record results. This PR does not claim benchmark results; it adds the runnable adapter needed to produce those results in the official MLE-bench harness without mixing private Kaggle/API execution state into the code change.

Testing

  • python3 -m pytest tests/unit/benchmark/test_mlebench_adapter.py -q
  • python3 -m py_compile tests/benchmark/mlebench/plexe/run_mlebench.py
  • python3 -m black tests/benchmark/mlebench/plexe/run_mlebench.py tests/unit/benchmark/test_mlebench_adapter.py

Disclosure

I used AI assistance while preparing this contribution and reviewed the resulting code before opening the PR.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 28, 2026

Greptile Summary

This PR adds a self-contained MLE-bench agent adapter under tests/benchmark/mlebench — a Dockerfile, entrypoint script, Python runner, and unit tests — so Plexe can be evaluated against the official OpenAI MLE-bench harness without mixing benchmark execution state into the main codebase.

  • config.yaml registers the agent but references plexe/start.sh and plexe/Dockerfile while those files sit at the directory root, so MLE-bench will fail to locate them when the adapter is placed at agents/plexe/ in an MLE-bench checkout.
  • run_mlebench.py handles dataset discovery, model training via plexe.main.main, and submission shaping; the coerce_predictions_to_submission helper slices predictions to len(test_df) as an upper bound but has no guard for the reverse case, causing a ValueError when the model returns fewer rows than the test set.

Confidence Score: 3/5

Two defects need fixing before the adapter can run end-to-end: a wrong path prefix in config.yaml that prevents MLE-bench from even starting the container, and a missing lower-bound guard in submission shaping that crashes the run when prediction count is less than test row count.

The config.yaml start/dockerfile path mismatch means the MLE-bench harness will fail to launch any container at all — every attempted benchmark run would error out before Plexe executes. The submission-shaping length mismatch is a second independent crash path that surfaces once the container does run. Both are on the primary execution path of the adapter.

tests/benchmark/mlebench/config.yaml and the coerce_predictions_to_submission function in tests/benchmark/mlebench/plexe/run_mlebench.py

Important Files Changed

Filename Overview
tests/benchmark/mlebench/config.yaml Agent config references plexe/start.sh and plexe/Dockerfile, but those files are at the directory root — the extra plexe/ prefix will cause MLE-bench to fail to locate either file.
tests/benchmark/mlebench/plexe/run_mlebench.py Core adapter logic: dataset discovery, predictor loading, and submission shaping are all present and well-structured, but coerce_predictions_to_submission will raise a ValueError when the model returns fewer predictions than test rows (no lower-bound length guard).
tests/benchmark/mlebench/Dockerfile Installs plexe with extras into the agent conda env; COPY . ${AGENT_DIR} correctly places plexe/run_mlebench.py at the path start.sh expects.
tests/benchmark/mlebench/start.sh Correctly activates the conda env, runs ${AGENT_DIR}/plexe/run_mlebench.py, and calls the validate script; the previously-reported path bug is now fixed.
tests/unit/benchmark/test_mlebench_adapter.py Unit tests cover dataset discovery, submission column shaping, submission copy, entrypoint path correctness, and model-type validation; all look correct.
tests/benchmark/mlebench/README.md Clear usage documentation covering build, run, and grading steps; environment variable defaults are accurately described.
tests/benchmark/mlebench/plexe/init.py Trivial module marker; no issues.
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
tests/benchmark/mlebench/config.yaml:1-3
The `start` and `dockerfile` paths use a `plexe/` prefix that doesn't correspond to the actual file layout. `start.sh` and `Dockerfile` both live at the root of this directory (i.e., alongside `config.yaml`), not inside a nested `plexe/` sub-directory. When this config is placed at `agents/plexe/config.yaml` inside an MLE-bench checkout, the harness will look for `agents/plexe/plexe/start.sh` and `agents/plexe/plexe/Dockerfile`, neither of which exists, causing every container launch to fail before Plexe runs.

```suggestion
plexe:
  start: start.sh
  dockerfile: Dockerfile
```

### Issue 2 of 2
tests/benchmark/mlebench/plexe/run_mlebench.py:249-255
**Prediction/test length mismatch raises ValueError**

`submission` is built with `index=range(len(test_df))`, but the prediction array is sliced to `[: len(test_df)]` — only an upper-bound guard. When the model returns fewer predictions than there are test rows (e.g., silent row-dropping in a preprocessing step), assigning a shorter numpy array to the fixed-length DataFrame column raises `ValueError: Length of values does not match length of index`, causing the entire run to fail and produce no submission. Consider padding with `NaN` (or the sample submission's default value) to fill the gap, so a partial prediction can still be written out.

Reviews (2): Last reviewed commit: "fix(benchmark): harden MLE-bench adapter..." | Re-trigger Greptile

Comment thread tests/benchmark/mlebench/start.sh Outdated
Comment thread tests/benchmark/mlebench/plexe/run_mlebench.py Outdated
Signed-off-by: Ritwij Aryan Parmar <ritwij.aryan.parmar@gmail.com>
@RitwijParmar
Copy link
Copy Markdown
Author

RitwijParmar commented May 28, 2026

Thanks for catching this, actually the entrypoint path could have made the adapter fail before Plexe could run.

I pushed 6eb529b with the path fixed, clearer handling for a missing or unsupported model_type, and two small regression tests around those cases.

I re-ran the adapter unit tests locally (6 passed) plus the runner compile and shell syntax checks.

Comment on lines +249 to +255
) -> None:
"""Create `submission.csv` by running the packaged Plexe predictor on test rows."""

package_dir = work_dir / "model"
predictor = load_predictor(package_dir)
test_df = read_tabular_sample(test_dataset)
id_column = infer_id_column(sample_submission, test_dataset)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Prediction/test length mismatch raises ValueError

submission is built with index=range(len(test_df)), but the prediction array is sliced to [: len(test_df)] — only an upper-bound guard. When the model returns fewer predictions than there are test rows (e.g., silent row-dropping in a preprocessing step), assigning a shorter numpy array to the fixed-length DataFrame column raises ValueError: Length of values does not match length of index, causing the entire run to fail and produce no submission. Consider padding with NaN (or the sample submission's default value) to fill the gap, so a partial prediction can still be written out.

Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/benchmark/mlebench/plexe/run_mlebench.py
Line: 249-255

Comment:
**Prediction/test length mismatch raises ValueError**

`submission` is built with `index=range(len(test_df))`, but the prediction array is sliced to `[: len(test_df)]` — only an upper-bound guard. When the model returns fewer predictions than there are test rows (e.g., silent row-dropping in a preprocessing step), assigning a shorter numpy array to the fixed-length DataFrame column raises `ValueError: Length of values does not match length of index`, causing the entire run to fail and produce no submission. Consider padding with `NaN` (or the sample submission's default value) to fill the gap, so a partial prediction can still be written out.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant