Skip to content

data-eng: eval.sh wrapper to keep /opt/env on PATH for agent self-evals#7

Draft
shauryr wants to merge 1 commit into
mainfrom
feature/eval-wrapper
Draft

data-eng: eval.sh wrapper to keep /opt/env on PATH for agent self-evals#7
shauryr wants to merge 1 commit into
mainfrom
feature/eval-wrapper

Conversation

@shauryr
Copy link
Copy Markdown
Collaborator

@shauryr shauryr commented May 29, 2026

Problem

Inside the apptainer container for the data-eng prompt, run_task.sh passes PATH="/opt/env/local/bin:/opt/env/bin:..." through apptainer exec --env so the bind-mounted Python env's CLIs (including vllm) are reachable. The vllm binary itself is present at /opt/env/local/bin/vllm.

However, codex CLI runs every shell command via bash -lc "..." (login shell). The login flag sources /etc/profile + ~/.bashrc, which overwrites PATH with the container's defaults and strips out /opt/env/local/bin. So the agent sees vllm: command not found even though the binary is mounted and executable.

Why it matters

The V2 discipline framework (separate, unmerged branch feature/v2-discipline) requires every experiment to self-eval via evaluate.py --json-output-file experiments/exp_<N>/eval_result.json. inspect_ai spawns a local vllm server. If vllm isn't on PATH, the eval fails, the agent can't fill in ## Outcome: eval_after: <X>, and publish_experiment.py refuses the row.

Observed in both V1 (6h) and V2 (12h clean-slate) data-eng pilots: agents rediscover the issue and manually prefix with PATH=/opt/env/local/bin:$PATH python3 evaluate.py ..., burning ~5 min of exploration each time.

Fix

A small eval.sh wrapper added to src/eval/general/ that re-asserts the bind-mounted PATH before exec'ing evaluate.py:

export PATH="/opt/env/local/bin:/opt/env/bin:${PATH}"
exec python3 /home/ben/task/evaluate.py "$@"

run_task.sh copies the wrapper into ${JOB_DIR}/task/ only when POST_TRAIN_BENCH_PROMPT=data_eng_prompt, so default-prompt runs are unchanged.

Agents call bash eval.sh ... instead of python3 evaluate.py ....

Scope

  • New file: src/eval/general/eval.sh (executable)
  • src/run_task.sh: +6 lines inside the existing data-eng-gated cp block

Diff: 2 files, +24 insertions.

Out of scope

The data_eng_prompt.txt update telling the agent to USE the wrapper lives in the V2 branch and will land with that PR. This PR ships only the infrastructure piece so it's usable independently and can land before V2 is ready.

Draft

Marked draft because:

  • V1 pilots already work around the issue manually, so this is a quality-of-life fix, not a blocker.
  • The V2 branch (which would actually exercise the wrapper at scale via the locked Step 7 self-eval) is still under review/iteration.

The codex CLI runs every shell command via `bash -lc "..."` (login
shell), which sources /etc/profile + ~/.bashrc and overwrites PATH —
stripping the apptainer-injected /opt/env/local/bin entry where the
bind-mounted `vllm` CLI lives. As a result agents see
`vllm: command not found` and inspect_ai can't spawn a local server.

Observed in both V1 (6h) and V2 (12h clean-slate) data-eng pilots:
agents rediscover the issue and manually prefix commands with
`PATH=/opt/env/local/bin:$PATH ...`, burning ~5 min of exploration
each time.

This adds a small `eval.sh` wrapper that re-asserts PATH and execs
`python3 evaluate.py "$@"`. Copied into the task workspace only when
POST_TRAIN_BENCH_PROMPT=data_eng_prompt, so default-prompt runs are
unchanged. The data_eng_prompt.txt update to actually USE the wrapper
lives in feature/v2-discipline (separate PR).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant