Skip to content

docs: update KubeflowExecutor guide and add e2e example#496

Merged
ko3n1g merged 4 commits intomainfrom
ko3n1g/docs/kubeflow-executor
Apr 22, 2026
Merged

docs: update KubeflowExecutor guide and add e2e example#496
ko3n1g merged 4 commits intomainfrom
ko3n1g/docs/kubeflow-executor

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented Apr 22, 2026

Claude summary

Summary

  • Removes the outdated PyTorchJob (Training Operator v1) docs and job_kind parameter — KubeflowExecutor only supports TrainJob (trainer.kubeflow.org/v1alpha1)
  • Replaces the two-snippet example with a single realistic configuration drawn from local/real_trainjob.py
  • Adds an Advanced options table for less-obvious parameters (nprocs_per_node, extra_resource_requests/limits, pod_spec_overrides, container_kwargs, workdir_local_path)
  • Adds examples/kubeflow/hello_kubeflow.py — a self-contained, runnable end-to-end script with:
    • CLI flags for namespace, image, node count, and PVC name
    • Full executor setup: dshm volume, secret injection via env_list, workdir PVC sync
    • SIGINT/SIGTERM handlers registered before job submission that call executor.cancel(wait=True) for graceful cleanup
    • Live log streaming via run.Experiment with tail_logs=True
  • Updates the guide to link to the new example

Test plan

  • Docs render correctly (MyST/Sphinx preview)
  • python examples/kubeflow/hello_kubeflow.py --help exits cleanly (no import errors)
  • Signal handler calls executor.cancel() on Ctrl-C before the job finishes
  • Example parameters match the current KubeflowExecutor dataclass fields
  • No references to job_kind or PyTorchJob remain in the section

Remove the outdated PyTorchJob (v1) example and job_kind parameter
references — KubeflowExecutor now only supports TrainJob
(trainer.kubeflow.org/v1alpha1). Replace the two-snippet example with
a single, comprehensive configuration derived from the real-world
local/real_trainjob.py, covering env_list, tolerations, volumes,
workdir_pvc, and image_pull_secrets. Add an advanced-options table for
less-common parameters (nprocs_per_node, extra_resource_requests,
pod_spec_overrides, container_kwargs, workdir_local_path).

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Add examples/kubeflow/hello_kubeflow.py — a self-contained script that
shows the complete executor setup (Torchrun launcher, env_list, tolerations,
dshm volume, PVC workdir sync) with CLI flags for namespace, image, node
count, and PVC name. Update docs/guides/execution.md to link to the new
example after the configuration snippet.

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g ko3n1g changed the title docs: update KubeflowExecutor guide to TrainJob v2 only docs: update KubeflowExecutor guide and add e2e example Apr 22, 2026
…example

Register SIGINT/SIGTERM handlers before job submission so Ctrl-C or pod
eviction triggers executor.cancel(wait=True). Switch from run.run() to
run.Experiment so tail_logs=True can be passed to exp.run(), streaming
pod logs back to the terminal.

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g ko3n1g merged commit a8425c9 into main Apr 22, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants