data-eng: eval.sh wrapper to keep /opt/env on PATH for agent self-evals#7
Draft
shauryr wants to merge 1 commit into
Draft
data-eng: eval.sh wrapper to keep /opt/env on PATH for agent self-evals#7shauryr wants to merge 1 commit into
shauryr wants to merge 1 commit into
Conversation
The codex CLI runs every shell command via `bash -lc "..."` (login shell), which sources /etc/profile + ~/.bashrc and overwrites PATH — stripping the apptainer-injected /opt/env/local/bin entry where the bind-mounted `vllm` CLI lives. As a result agents see `vllm: command not found` and inspect_ai can't spawn a local server. Observed in both V1 (6h) and V2 (12h clean-slate) data-eng pilots: agents rediscover the issue and manually prefix commands with `PATH=/opt/env/local/bin:$PATH ...`, burning ~5 min of exploration each time. This adds a small `eval.sh` wrapper that re-asserts PATH and execs `python3 evaluate.py "$@"`. Copied into the task workspace only when POST_TRAIN_BENCH_PROMPT=data_eng_prompt, so default-prompt runs are unchanged. The data_eng_prompt.txt update to actually USE the wrapper lives in feature/v2-discipline (separate PR).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Inside the apptainer container for the data-eng prompt,
run_task.shpassesPATH="/opt/env/local/bin:/opt/env/bin:..."throughapptainer exec --envso the bind-mounted Python env's CLIs (includingvllm) are reachable. Thevllmbinary itself is present at/opt/env/local/bin/vllm.However, codex CLI runs every shell command via
bash -lc "..."(login shell). The login flag sources/etc/profile+~/.bashrc, which overwrites PATH with the container's defaults and strips out/opt/env/local/bin. So the agent seesvllm: command not foundeven though the binary is mounted and executable.Why it matters
The V2 discipline framework (separate, unmerged branch
feature/v2-discipline) requires every experiment to self-eval viaevaluate.py --json-output-file experiments/exp_<N>/eval_result.json.inspect_aispawns a localvllmserver. Ifvllmisn't on PATH, the eval fails, the agent can't fill in## Outcome: eval_after: <X>, andpublish_experiment.pyrefuses the row.Observed in both V1 (6h) and V2 (12h clean-slate) data-eng pilots: agents rediscover the issue and manually prefix with
PATH=/opt/env/local/bin:$PATH python3 evaluate.py ..., burning ~5 min of exploration each time.Fix
A small
eval.shwrapper added tosrc/eval/general/that re-asserts the bind-mounted PATH before exec'ingevaluate.py:run_task.shcopies the wrapper into${JOB_DIR}/task/only whenPOST_TRAIN_BENCH_PROMPT=data_eng_prompt, so default-prompt runs are unchanged.Agents call
bash eval.sh ...instead ofpython3 evaluate.py ....Scope
src/eval/general/eval.sh(executable)src/run_task.sh: +6 lines inside the existing data-eng-gatedcpblockDiff: 2 files, +24 insertions.
Out of scope
The
data_eng_prompt.txtupdate telling the agent to USE the wrapper lives in the V2 branch and will land with that PR. This PR ships only the infrastructure piece so it's usable independently and can land before V2 is ready.Draft
Marked draft because: