refactor: trim GPU provisioning critical path (skip redundant pull, async cleanup, defer DCGM)#8615
Open
ganeshkumarashok wants to merge 3 commits into
Open
refactor: trim GPU provisioning critical path (skip redundant pull, async cleanup, defer DCGM)#8615ganeshkumarashok wants to merge 3 commits into
ganeshkumarashok wants to merge 3 commits into
Conversation
… image cleanup, defer DCGM) Three low-risk CSE-time optimizations for GPU nodes, none of which change the default driver install behavior: 1. Skip the redundant `ctr image pull` in configGPUDrivers() when the driver image is already present locally. The image is normally pre-pulled into the VHD, so the boot-time pull was paying a wasted manifest/layer round trip to MCR; we still pull as a fallback when the image is genuinely missing. 2. Drop `--sync` from the post-install `ctr images rm` so containerd garbage collection happens asynchronously instead of blocking provisioning. The image reference is still removed to reclaim disk. 3. Start nvidia-dcgm and nvidia-dcgm-exporter with systemctlEnableAndStartNoBlock and treat a slow/failed start as non-fatal. These are telemetry only and do not gate GPU workload scheduling. The nvidia-device-plugin start stays blocking and fatal because it gates the node advertising GPUs to the scheduler. Adds shellspec coverage for startNvidiaManagedExpServices asserting the device-plugin stays blocking while dcgm/dcgm-exporter are enqueued off the critical path and do not fail provisioning. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR optimizes the Ubuntu GPU provisioning path in cse_config.sh by avoiding unnecessary container registry pulls, removing a synchronous containerd GC wait, and moving DCGM telemetry service startup off the CSE critical path (non-fatal, non-blocking).
Changes:
- Skip
ctr image pullfor the NVIDIA driver image when it already exists locally. - Remove
--syncfromctr images rmto avoid blocking on containerd garbage collection. - Start
nvidia-dcgmandnvidia-dcgm-exporterviasystemctlEnableAndStartNoBlockand treat failures as non-fatal; keepnvidia-device-pluginblocking/fatal.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| parts/linux/cloud-init/artifacts/cse_config.sh | Adds conditional image pull, async image removal, and defers DCGM/DCGM-exporter startup off the provisioning critical path. |
| spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh | Adds ShellSpec coverage validating device-plugin remains blocking while DCGM services are enqueued non-blocking and are non-fatal on failure. |
Address Copilot review: grep -qx treats the image ref as a regex, so dots in mcr.microsoft.com/... match any char and can yield false positives that skip a genuinely-needed pull. Use -qxF for an exact whole-line match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
surajssd
reviewed
Jun 1, 2026
Per review feedback: instead of text-matching `ctr images ls -q` output with grep, query containerd directly with its exact-name filter (`images ls -q name==<ref>`) and test for empty output. ctr has no JSON/format output for images ls, but the native filter is the structured equivalent and avoids any text-matching pitfalls entirely. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment on lines
+1009
to
+1011
| if [ -z "$(ctr -n k8s.io images ls -q "name==${NVIDIA_DRIVER_IMAGE}:${NVIDIA_DRIVER_IMAGE_TAG}")" ]; then | ||
| ctr -n k8s.io image pull $NVIDIA_DRIVER_IMAGE:$NVIDIA_DRIVER_IMAGE_TAG | ||
| fi |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three small, independent, low-risk optimizations that remove redundant waits from the GPU node provisioning critical path on Ubuntu. None of these touch the dominant cost — the per-node DKMS driver compile +
cp -a /opt/gpuinsideconfigGPUDrivers, which is ~95% of GPU CSE time. The intent here is hygiene and tail-latency protection, not the headline speedup. The large win (moving the driver build off the boot path / pre-laying the driver tree on the VHD) is tracked separately.Expected impact: low single-digit seconds off the median for a default GPU node, plus meaningful p99/throttling-tail protection. The DCGM change (#3) only affects the managed-GPU experience path.
1. Skip the redundant driver image pull (
configGPUDrivers)The
aks-gpu-cudaimage is pre-fetched into the VHD at build time (install-dependencies.sh→image-fetcher), so it is already present locally at boot.configGPUDrivers()nonetheless ranctr image pullunconditionally — a wasted manifest/layer round trip to MCR and exposure to MCR throttling. Now we only pull when the image is genuinely absent locally; otherwise we go straight toctr run. Median savings are small; the real value is avoiding a multi-second-to-minutes stall when MCR is throttling.2. Async image cleanup (drop
--sync)The post-install
ctr images rm --syncblocked CSE waiting for containerd GC to finish before returning. Dropping--syncremoves the image reference immediately and lets GC reclaim space asynchronously — same disk outcome, no blocking. Small, on-path saving.3. Defer DCGM telemetry off the critical path
nvidia-dcgmandnvidia-dcgm-exporterare monitoring only and don't gate GPU workload scheduling, yet they were started with the blockingsystemctlEnableAndStartand hard-exited CSE on a slow/failed start (up to a 30s timeout each). They now usesystemctlEnableAndStartNoBlockand are non-fatal. Thenvidia-device-pluginstart stays blocking and fatal because it gates the node advertising GPUs to the scheduler. Note: this path only runs when the managed-GPU experience is enabled, so it does not affect default GPU nodes today.Tests
startNvidiaManagedExpServices: asserts device-plugin stays blocking while dcgm/dcgm-exporter are enqueued off the critical path and don't fail provisioning. (7/7 GPU-service examples pass.)go test ./pkg/agent/...— pass.make generate— no snapshot diffs.Risk / behavior notes
grep -qxF) to avoid regex false positives from dots in the image ref.systemctlEnableAndStartNoBlockreturns before the service actually starts, thedcgm-exporter=enablednode label is now a feature-intent label rather than a "metrics confirmed up" signal. Reviewers who want DCGM failures to remain fatal should flag it.Coordination
Touches
configGPUDrivers(), which the held PR #8612 (prebuild GPU kernel module) also edits — these two will need a light rebase against each other whenever both land. This PR is independent and can merge on its own.Re-opened from the same repo (was #8613 from a fork; AgentBaker requires same-repo branches for CI/secrets).