perf: trim GPU provisioning critical path (skip redundant pull, async cleanup, defer DCGM)#8613
Open
ganeshkumarashok wants to merge 1 commit into
Open
Conversation
… image cleanup, defer DCGM) Three low-risk CSE-time optimizations for GPU nodes, none of which change the default driver install behavior: 1. Skip the redundant `ctr image pull` in configGPUDrivers() when the driver image is already present locally. The image is normally pre-pulled into the VHD, so the boot-time pull was paying a wasted manifest/layer round trip to MCR; we still pull as a fallback when the image is genuinely missing. 2. Drop `--sync` from the post-install `ctr images rm` so containerd garbage collection happens asynchronously instead of blocking provisioning. The image reference is still removed to reclaim disk. 3. Start nvidia-dcgm and nvidia-dcgm-exporter with systemctlEnableAndStartNoBlock and treat a slow/failed start as non-fatal. These are telemetry only and do not gate GPU workload scheduling. The nvidia-device-plugin start stays blocking and fatal because it gates the node advertising GPUs to the scheduler. Adds shellspec coverage for startNvidiaManagedExpServices asserting the device-plugin stays blocking while dcgm/dcgm-exporter are enqueued off the critical path and do not fail provisioning. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR optimizes Ubuntu GPU node provisioning in the Linux CSE path by avoiding redundant container image pulls, reducing post-install blocking work, and moving non-critical DCGM telemetry startup off the provisioning critical path.
Changes:
- Skip pulling the NVIDIA driver container image when it’s already present in containerd.
- Make driver image cleanup non-blocking by dropping
ctr images rm --sync. - Start
nvidia-dcgmandnvidia-dcgm-exporterasynchronously and treat enqueue failures as non-fatal, while keepingnvidia-device-pluginblocking/fatal.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| parts/linux/cloud-init/artifacts/cse_config.sh | Adds a local-image presence check before pulling, makes image cleanup async, and defers DCGM/exporter startup off the critical path with non-fatal handling. |
| spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh | Adds ShellSpec coverage ensuring device-plugin remains blocking while DCGM/exporter are started via the no-block path and failures don’t fail provisioning. |
| ctr -n k8s.io image pull $NVIDIA_DRIVER_IMAGE:$NVIDIA_DRIVER_IMAGE_TAG | ||
| # The driver image is normally pre-pulled into the VHD; only hit the registry when it is | ||
| # actually missing so provisioning doesn't pay a redundant manifest/layer round trip. | ||
| if ! ctr -n k8s.io images ls -q | grep -qx "$NVIDIA_DRIVER_IMAGE:$NVIDIA_DRIVER_IMAGE_TAG"; then |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three small, independent CSE-time optimizations that trim the GPU node provisioning critical path on Ubuntu. None change the default driver install behavior; each is low-risk and self-contained.
1. Skip the redundant driver image pull (
configGPUDrivers)The
aks-gpu-cudaimage is normally pre-pulled into the VHD, butconfigGPUDrivers()unconditionally ranctr image pullat boot — a wasted manifest/layer round trip to MCR (and exposure to MCR throttling). Now we only pull when the image is genuinely absent locally; otherwise we go straight toctr run.2. Async image cleanup (drop
--sync)The post-install
ctr images rm --syncblocked CSE waiting for containerd GC to finish. Dropping--syncremoves the image reference immediately and lets GC reclaim space asynchronously — same disk outcome, no blocking.3. Defer DCGM telemetry off the critical path
nvidia-dcgmandnvidia-dcgm-exporterare monitoring only and don't gate GPU workload scheduling, yet they were started with the blockingsystemctlEnableAndStartand hard-exited CSE on a slow/failed start. They now usesystemctlEnableAndStartNoBlockand are non-fatal. Thenvidia-device-pluginstart stays blocking and fatal because it gates the node advertising GPUs to the scheduler.Tests
startNvidiaManagedExpServices: asserts device-plugin stays blocking while dcgm/dcgm-exporter are enqueued off the critical path and don't fail provisioning. (7/7 GPU-service examples pass.)go test ./pkg/agent/...— pass.make generate— no snapshot diffs.Risk / behavior notes
Coordination
Touches
configGPUDrivers(), which the held PR #8612 (prebuild GPU kernel module) also edits — these two will need a light rebase against each other whenever both land. This PR is independent and can merge on its own.