feat(kubeflow): add setup_commands run once per pod before the job#545
Draft
ko3n1g wants to merge 2 commits into
Draft
feat(kubeflow): add setup_commands run once per pod before the job#545ko3n1g wants to merge 2 commits into
ko3n1g wants to merge 2 commits into
Conversation
KubeflowExecutor.setup_commands is a list of shell commands rendered into the generated launch.sh between the /nemo_run symlink and the training command. launch.sh runs once per pod before torchrun spawns the per-GPU ranks, so each setup command executes exactly once per node (not per rank), under errexit. Use case: install a dependency missing from the container image into the container venv (e.g. a broken release-candidate image) without rebuilding, when running synced code via the workdir data-mover. Signed-off-by: oliver könig <okoenig@nvidia.com>
On Kubernetes the training pods write every recipe output to the shared workdir PVC, including the PyTorch profiler chrome trace and CUDA memory snapshot which land under /nemo_run (the PVC code_dir). The launcher that collects artifacts and parses logs only sees the local job_dir, so without a copy-back those outputs are stranded on the PVC and never reach CI artifacts. Override KubeflowExecutor.cleanup() to reuse the existing pull_results() data-mover, mirroring code_dir back to job_dir before teardown. Best-effort: a failed pull never breaks cleanup. Gated on workdir_pvc, so non-PVC (slurm/local) runs are unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Claude summary
Background / motivation
workdir_pvcdata-mover (no image rebuild), a container may be missing a dependency (e.g. a broken release-candidate image). There was no hook to run a command once per pod before the job.What changed
KubeflowExecutor.setup_commands: list[str]— shell commands rendered into the generatedlaunch.shbetween the/nemo_runsymlink and the training command.Details
launch.shruns once per pod, beforetorchrunspawns the per-GPU ranks, so each command executes exactly once per node (not once per rank) and underset -e(a failure aborts the pod). Empty by default → no change to existing launch scripts.setup_commands=["uv pip install nvidia-resiliency-ext==0.6.0"]to patch a missing dep into the container venv without rebuilding the image.Tested
setup_commands(valid bash either way). End-to-end validation pending via a Megatron-Bridge K8s job that sets it.