Skip to content

feat(kubeflow): add setup_commands run once per pod before the job#545

Draft
ko3n1g wants to merge 2 commits into
mainfrom
ko3n1g/feat/kubeflow-setup-commands
Draft

feat(kubeflow): add setup_commands run once per pod before the job#545
ko3n1g wants to merge 2 commits into
mainfrom
ko3n1g/feat/kubeflow-setup-commands

Conversation

@ko3n1g

@ko3n1g ko3n1g commented Jun 12, 2026

Copy link
Copy Markdown
Contributor
Claude summary

Background / motivation

  • When running synced code via the Kubeflow workdir_pvc data-mover (no image rebuild), a container may be missing a dependency (e.g. a broken release-candidate image). There was no hook to run a command once per pod before the job.

What changed

  • New KubeflowExecutor.setup_commands: list[str] — shell commands rendered into the generated launch.sh between the /nemo_run symlink and the training command.

Details

  • launch.sh runs once per pod, before torchrun spawns the per-GPU ranks, so each command executes exactly once per node (not once per rank) and under set -e (a failure aborts the pod). Empty by default → no change to existing launch scripts.
  • Typical use: setup_commands=["uv pip install nvidia-resiliency-ext==0.6.0"] to patch a missing dep into the container venv without rebuilding the image.
# rendered launch.sh (excerpt)
ln -sfn <code_dir> /nemo_run
echo "Running setup commands..."
uv pip install nvidia-resiliency-ext==0.6.0
echo "Starting training command..."
...

Tested

  • Jinja2 render verified for both populated and empty setup_commands (valid bash either way). End-to-end validation pending via a Megatron-Bridge K8s job that sets it.

KubeflowExecutor.setup_commands is a list of shell commands rendered into the
generated launch.sh between the /nemo_run symlink and the training command.
launch.sh runs once per pod before torchrun spawns the per-GPU ranks, so each
setup command executes exactly once per node (not per rank), under errexit.

Use case: install a dependency missing from the container image into the
container venv (e.g. a broken release-candidate image) without rebuilding,
when running synced code via the workdir data-mover.

Signed-off-by: oliver könig <okoenig@nvidia.com>
On Kubernetes the training pods write every recipe output to the shared
workdir PVC, including the PyTorch profiler chrome trace and CUDA memory
snapshot which land under /nemo_run (the PVC code_dir). The launcher that
collects artifacts and parses logs only sees the local job_dir, so without
a copy-back those outputs are stranded on the PVC and never reach CI
artifacts.

Override KubeflowExecutor.cleanup() to reuse the existing pull_results()
data-mover, mirroring code_dir back to job_dir before teardown. Best-effort:
a failed pull never breaks cleanup. Gated on workdir_pvc, so non-PVC
(slurm/local) runs are unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant