feat(kubeflow): add setup_commands run once per pod before the job by ko3n1g · Pull Request #545 · NVIDIA-NeMo/Run

ko3n1g · 2026-06-12T12:20:13Z

Claude summary

Background / motivation

When running synced code via the Kubeflow workdir_pvc data-mover (no image rebuild), a container may be missing a dependency (e.g. a broken release-candidate image). There was no hook to run a command once per pod before the job.

What changed

New KubeflowExecutor.setup_commands: list[str] — shell commands rendered into the generated launch.sh between the /nemo_run symlink and the training command.

Details

launch.sh runs once per pod, before torchrun spawns the per-GPU ranks, so each command executes exactly once per node (not once per rank) and under set -e (a failure aborts the pod). Empty by default → no change to existing launch scripts.
Typical use: setup_commands=["uv pip install nvidia-resiliency-ext==0.6.0"] to patch a missing dep into the container venv without rebuilding the image.

# rendered launch.sh (excerpt)
ln -sfn <code_dir> /nemo_run
echo "Running setup commands..."
uv pip install nvidia-resiliency-ext==0.6.0
echo "Starting training command..."
...

Tested

Jinja2 render verified for both populated and empty setup_commands (valid bash either way). End-to-end validation pending via a Megatron-Bridge K8s job that sets it.

KubeflowExecutor.setup_commands is a list of shell commands rendered into the generated launch.sh between the /nemo_run symlink and the training command. launch.sh runs once per pod before torchrun spawns the per-GPU ranks, so each setup command executes exactly once per node (not per rank), under errexit. Use case: install a dependency missing from the container image into the container venv (e.g. a broken release-candidate image) without rebuilding, when running synced code via the workdir data-mover. Signed-off-by: oliver könig <okoenig@nvidia.com>

On Kubernetes the training pods write every recipe output to the shared workdir PVC, including the PyTorch profiler chrome trace and CUDA memory snapshot which land under /nemo_run (the PVC code_dir). The launcher that collects artifacts and parses logs only sees the local job_dir, so without a copy-back those outputs are stranded on the PVC and never reach CI artifacts. Override KubeflowExecutor.cleanup() to reuse the existing pull_results() data-mover, mirroring code_dir back to job_dir before teardown. Best-effort: a failed pull never breaks cleanup. Gated on workdir_pvc, so non-PVC (slurm/local) runs are unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g temporarily deployed to public June 12, 2026 12:20 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public June 12, 2026 12:22 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public June 13, 2026 17:02 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public June 13, 2026 17:03 — with GitHub Actions Inactive

ko3n1g deployed to public June 13, 2026 17:04 — with GitHub Actions Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(kubeflow): add setup_commands run once per pod before the job#545

feat(kubeflow): add setup_commands run once per pod before the job#545
ko3n1g wants to merge 2 commits into
mainfrom
ko3n1g/feat/kubeflow-setup-commands

ko3n1g commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ko3n1g commented Jun 12, 2026

Background / motivation

What changed

Details

Tested

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant