Add MLE-bench agent adapter by RitwijParmar · Pull Request #191 · plexe-ai/plexe

RitwijParmar · 2026-05-28T16:09:28Z

Summary

adds a self-contained MLE-bench agent adapter under tests/benchmark/mlebench
includes Docker/config/start entrypoints for the official openai/mle-bench runner
adds a Python runner that discovers train/test/sample submission files, runs Plexe, shapes predictions into /home/submission/submission.csv, and writes debug metadata/logs
adds unit tests for dataset discovery, submission shaping, and existing submission copy behavior

Why

This follows up on #124. The issue asks for a way to run Plexe on MLE-bench Lite with 10 iterations and record results. This PR does not claim benchmark results; it adds the runnable adapter needed to produce those results in the official MLE-bench harness without mixing private Kaggle/API execution state into the code change.

Testing

python3 -m pytest tests/unit/benchmark/test_mlebench_adapter.py -q
python3 -m py_compile tests/benchmark/mlebench/plexe/run_mlebench.py
python3 -m black tests/benchmark/mlebench/plexe/run_mlebench.py tests/unit/benchmark/test_mlebench_adapter.py

Disclosure

I used AI assistance while preparing this contribution and reviewed the resulting code before opening the PR.

greptile-apps · 2026-05-28T16:12:01Z

Greptile Summary

This PR adds a self-contained MLE-bench agent adapter under tests/benchmark/mlebench — a Dockerfile, entrypoint script, Python runner, and unit tests — so Plexe can be evaluated against the official OpenAI MLE-bench harness without mixing benchmark execution state into the main codebase.

config.yaml registers the agent but references plexe/start.sh and plexe/Dockerfile while those files sit at the directory root, so MLE-bench will fail to locate them when the adapter is placed at agents/plexe/ in an MLE-bench checkout.
run_mlebench.py handles dataset discovery, model training via plexe.main.main, and submission shaping; the coerce_predictions_to_submission helper slices predictions to len(test_df) as an upper bound but has no guard for the reverse case, causing a ValueError when the model returns fewer rows than the test set.

Confidence Score: 3/5

Two defects need fixing before the adapter can run end-to-end: a wrong path prefix in config.yaml that prevents MLE-bench from even starting the container, and a missing lower-bound guard in submission shaping that crashes the run when prediction count is less than test row count.

The config.yaml start/dockerfile path mismatch means the MLE-bench harness will fail to launch any container at all — every attempted benchmark run would error out before Plexe executes. The submission-shaping length mismatch is a second independent crash path that surfaces once the container does run. Both are on the primary execution path of the adapter.

tests/benchmark/mlebench/config.yaml and the coerce_predictions_to_submission function in tests/benchmark/mlebench/plexe/run_mlebench.py

Important Files Changed

Filename	Overview
tests/benchmark/mlebench/config.yaml	Agent config references `plexe/start.sh` and `plexe/Dockerfile`, but those files are at the directory root — the extra `plexe/` prefix will cause MLE-bench to fail to locate either file.
tests/benchmark/mlebench/plexe/run_mlebench.py	Core adapter logic: dataset discovery, predictor loading, and submission shaping are all present and well-structured, but `coerce_predictions_to_submission` will raise a ValueError when the model returns fewer predictions than test rows (no lower-bound length guard).
tests/benchmark/mlebench/Dockerfile	Installs plexe with extras into the `agent` conda env; `COPY . ${AGENT_DIR}` correctly places `plexe/run_mlebench.py` at the path `start.sh` expects.
tests/benchmark/mlebench/start.sh	Correctly activates the conda env, runs `${AGENT_DIR}/plexe/run_mlebench.py`, and calls the validate script; the previously-reported path bug is now fixed.
tests/unit/benchmark/test_mlebench_adapter.py	Unit tests cover dataset discovery, submission column shaping, submission copy, entrypoint path correctness, and model-type validation; all look correct.
tests/benchmark/mlebench/README.md	Clear usage documentation covering build, run, and grading steps; environment variable defaults are accurately described.
tests/benchmark/mlebench/plexe/init.py	Trivial module marker; no issues.

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
tests/benchmark/mlebench/config.yaml:1-3
The `start` and `dockerfile` paths use a `plexe/` prefix that doesn't correspond to the actual file layout. `start.sh` and `Dockerfile` both live at the root of this directory (i.e., alongside `config.yaml`), not inside a nested `plexe/` sub-directory. When this config is placed at `agents/plexe/config.yaml` inside an MLE-bench checkout, the harness will look for `agents/plexe/plexe/start.sh` and `agents/plexe/plexe/Dockerfile`, neither of which exists, causing every container launch to fail before Plexe runs.

```suggestion
plexe:
  start: start.sh
  dockerfile: Dockerfile
```

### Issue 2 of 2
tests/benchmark/mlebench/plexe/run_mlebench.py:249-255
**Prediction/test length mismatch raises ValueError**

`submission` is built with `index=range(len(test_df))`, but the prediction array is sliced to `[: len(test_df)]` — only an upper-bound guard. When the model returns fewer predictions than there are test rows (e.g., silent row-dropping in a preprocessing step), assigning a shorter numpy array to the fixed-length DataFrame column raises `ValueError: Length of values does not match length of index`, causing the entire run to fail and produce no submission. Consider padding with `NaN` (or the sample submission's default value) to fill the gap, so a partial prediction can still be written out.

_{Reviews (2): Last reviewed commit: "fix(benchmark): harden MLE-bench adapter..." | Re-trigger Greptile}

Signed-off-by: Ritwij Aryan Parmar <ritwij.aryan.parmar@gmail.com>

RitwijParmar · 2026-05-28T16:24:04Z

Thanks for catching this, actually the entrypoint path could have made the adapter fail before Plexe could run.

I pushed 6eb529b with the path fixed, clearer handling for a missing or unsupported model_type, and two small regression tests around those cases.

I re-ran the adapter unit tests locally (6 passed) plus the runner compile and shell syntax checks.

greptile-apps · 2026-05-28T16:28:09Z

+) -> None:
+    """Create `submission.csv` by running the packaged Plexe predictor on test rows."""
+
+    package_dir = work_dir / "model"
+    predictor = load_predictor(package_dir)
+    test_df = read_tabular_sample(test_dataset)
+    id_column = infer_id_column(sample_submission, test_dataset)


Prediction/test length mismatch raises ValueError

submission is built with index=range(len(test_df)), but the prediction array is sliced to [: len(test_df)] — only an upper-bound guard. When the model returns fewer predictions than there are test rows (e.g., silent row-dropping in a preprocessing step), assigning a shorter numpy array to the fixed-length DataFrame column raises ValueError: Length of values does not match length of index, causing the entire run to fail and produce no submission. Consider padding with NaN (or the sample submission's default value) to fill the gap, so a partial prediction can still be written out.

Prompt To Fix With AI

This is a comment left during a code review. Path: tests/benchmark/mlebench/plexe/run_mlebench.py Line: 249-255 Comment: **Prediction/test length mismatch raises ValueError** `submission` is built with `index=range(len(test_df))`, but the prediction array is sliced to `[: len(test_df)]` — only an upper-bound guard. When the model returns fewer predictions than there are test rows (e.g., silent row-dropping in a preprocessing step), assigning a shorter numpy array to the fixed-length DataFrame column raises `ValueError: Length of values does not match length of index`, causing the entire run to fail and produce no submission. Consider padding with `NaN` (or the sample submission's default value) to fill the gap, so a partial prediction can still be written out. How can I resolve this? If you propose a fix, please make it concise.

feat(benchmark): add MLE-bench adapter

9b39643

greptile-apps Bot reviewed May 28, 2026

View reviewed changes

Comment thread tests/benchmark/mlebench/start.sh Outdated

Comment thread tests/benchmark/mlebench/plexe/run_mlebench.py Outdated

fix(benchmark): harden MLE-bench adapter startup

6eb529b

Signed-off-by: Ritwij Aryan Parmar <ritwij.aryan.parmar@gmail.com>

greptile-apps Bot reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MLE-bench agent adapter#191

Add MLE-bench agent adapter#191
RitwijParmar wants to merge 2 commits into
plexe-ai:mainfrom
RitwijParmar:codex/mlebench-agent-adapter

RitwijParmar commented May 28, 2026

Uh oh!

greptile-apps Bot commented May 28, 2026 •

edited

Loading

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

RitwijParmar commented May 28, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RitwijParmar commented May 28, 2026

Summary

Why

Testing

Disclosure

Uh oh!

greptile-apps Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

RitwijParmar commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented May 28, 2026 •

edited

Loading

RitwijParmar commented May 28, 2026 •

edited

Loading