Add MLE-bench agent adapter#191
Conversation
Greptile SummaryThis PR adds a self-contained MLE-bench agent adapter under
Confidence Score: 3/5Two defects need fixing before the adapter can run end-to-end: a wrong path prefix in config.yaml that prevents MLE-bench from even starting the container, and a missing lower-bound guard in submission shaping that crashes the run when prediction count is less than test row count. The config.yaml start/dockerfile path mismatch means the MLE-bench harness will fail to launch any container at all — every attempted benchmark run would error out before Plexe executes. The submission-shaping length mismatch is a second independent crash path that surfaces once the container does run. Both are on the primary execution path of the adapter. tests/benchmark/mlebench/config.yaml and the coerce_predictions_to_submission function in tests/benchmark/mlebench/plexe/run_mlebench.py
|
| Filename | Overview |
|---|---|
| tests/benchmark/mlebench/config.yaml | Agent config references plexe/start.sh and plexe/Dockerfile, but those files are at the directory root — the extra plexe/ prefix will cause MLE-bench to fail to locate either file. |
| tests/benchmark/mlebench/plexe/run_mlebench.py | Core adapter logic: dataset discovery, predictor loading, and submission shaping are all present and well-structured, but coerce_predictions_to_submission will raise a ValueError when the model returns fewer predictions than test rows (no lower-bound length guard). |
| tests/benchmark/mlebench/Dockerfile | Installs plexe with extras into the agent conda env; COPY . ${AGENT_DIR} correctly places plexe/run_mlebench.py at the path start.sh expects. |
| tests/benchmark/mlebench/start.sh | Correctly activates the conda env, runs ${AGENT_DIR}/plexe/run_mlebench.py, and calls the validate script; the previously-reported path bug is now fixed. |
| tests/unit/benchmark/test_mlebench_adapter.py | Unit tests cover dataset discovery, submission column shaping, submission copy, entrypoint path correctness, and model-type validation; all look correct. |
| tests/benchmark/mlebench/README.md | Clear usage documentation covering build, run, and grading steps; environment variable defaults are accurately described. |
| tests/benchmark/mlebench/plexe/init.py | Trivial module marker; no issues. |
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
tests/benchmark/mlebench/config.yaml:1-3
The `start` and `dockerfile` paths use a `plexe/` prefix that doesn't correspond to the actual file layout. `start.sh` and `Dockerfile` both live at the root of this directory (i.e., alongside `config.yaml`), not inside a nested `plexe/` sub-directory. When this config is placed at `agents/plexe/config.yaml` inside an MLE-bench checkout, the harness will look for `agents/plexe/plexe/start.sh` and `agents/plexe/plexe/Dockerfile`, neither of which exists, causing every container launch to fail before Plexe runs.
```suggestion
plexe:
start: start.sh
dockerfile: Dockerfile
```
### Issue 2 of 2
tests/benchmark/mlebench/plexe/run_mlebench.py:249-255
**Prediction/test length mismatch raises ValueError**
`submission` is built with `index=range(len(test_df))`, but the prediction array is sliced to `[: len(test_df)]` — only an upper-bound guard. When the model returns fewer predictions than there are test rows (e.g., silent row-dropping in a preprocessing step), assigning a shorter numpy array to the fixed-length DataFrame column raises `ValueError: Length of values does not match length of index`, causing the entire run to fail and produce no submission. Consider padding with `NaN` (or the sample submission's default value) to fill the gap, so a partial prediction can still be written out.
Reviews (2): Last reviewed commit: "fix(benchmark): harden MLE-bench adapter..." | Re-trigger Greptile
Signed-off-by: Ritwij Aryan Parmar <ritwij.aryan.parmar@gmail.com>
|
Thanks for catching this, actually the entrypoint path could have made the adapter fail before Plexe could run. I pushed 6eb529b with the path fixed, clearer handling for a missing or unsupported I re-ran the adapter unit tests locally (6 passed) plus the runner compile and shell syntax checks. |
| ) -> None: | ||
| """Create `submission.csv` by running the packaged Plexe predictor on test rows.""" | ||
|
|
||
| package_dir = work_dir / "model" | ||
| predictor = load_predictor(package_dir) | ||
| test_df = read_tabular_sample(test_dataset) | ||
| id_column = infer_id_column(sample_submission, test_dataset) |
There was a problem hiding this comment.
Prediction/test length mismatch raises ValueError
submission is built with index=range(len(test_df)), but the prediction array is sliced to [: len(test_df)] — only an upper-bound guard. When the model returns fewer predictions than there are test rows (e.g., silent row-dropping in a preprocessing step), assigning a shorter numpy array to the fixed-length DataFrame column raises ValueError: Length of values does not match length of index, causing the entire run to fail and produce no submission. Consider padding with NaN (or the sample submission's default value) to fill the gap, so a partial prediction can still be written out.
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/benchmark/mlebench/plexe/run_mlebench.py
Line: 249-255
Comment:
**Prediction/test length mismatch raises ValueError**
`submission` is built with `index=range(len(test_df))`, but the prediction array is sliced to `[: len(test_df)]` — only an upper-bound guard. When the model returns fewer predictions than there are test rows (e.g., silent row-dropping in a preprocessing step), assigning a shorter numpy array to the fixed-length DataFrame column raises `ValueError: Length of values does not match length of index`, causing the entire run to fail and produce no submission. Consider padding with `NaN` (or the sample submission's default value) to fill the gap, so a partial prediction can still be written out.
How can I resolve this? If you propose a fix, please make it concise.
Summary
tests/benchmark/mlebenchopenai/mle-benchrunner/home/submission/submission.csv, and writes debug metadata/logsWhy
This follows up on #124. The issue asks for a way to run Plexe on MLE-bench Lite with 10 iterations and record results. This PR does not claim benchmark results; it adds the runnable adapter needed to produce those results in the official MLE-bench harness without mixing private Kaggle/API execution state into the code change.
Testing
python3 -m pytest tests/unit/benchmark/test_mlebench_adapter.py -qpython3 -m py_compile tests/benchmark/mlebench/plexe/run_mlebench.pypython3 -m black tests/benchmark/mlebench/plexe/run_mlebench.py tests/unit/benchmark/test_mlebench_adapter.pyDisclosure
I used AI assistance while preparing this contribution and reviewed the resulting code before opening the PR.