refactor: trim GPU provisioning critical path (skip redundant pull, async cleanup, defer DCGM) by ganeshkumarashok · Pull Request #8615 · Azure/AgentBaker

ganeshkumarashok · 2026-06-01T17:21:30Z

Summary

Three small, independent, low-risk optimizations that remove redundant waits from the GPU node provisioning critical path on Ubuntu. None of these touch the dominant cost — the per-node DKMS driver compile + cp -a /opt/gpu inside configGPUDrivers, which is ~95% of GPU CSE time. The intent here is hygiene and tail-latency protection, not the headline speedup. The large win (moving the driver build off the boot path / pre-laying the driver tree on the VHD) is tracked separately.

Expected impact: low single-digit seconds off the median for a default GPU node, plus meaningful p99/throttling-tail protection. The DCGM change (#3) only affects the managed-GPU experience path.

1. Skip the redundant driver image pull (`configGPUDrivers`)

The aks-gpu-cuda image is pre-fetched into the VHD at build time (install-dependencies.sh → image-fetcher), so it is already present locally at boot. configGPUDrivers() nonetheless ran ctr image pull unconditionally — a wasted manifest/layer round trip to MCR and exposure to MCR throttling. Now we only pull when the image is genuinely absent locally; otherwise we go straight to ctr run. Median savings are small; the real value is avoiding a multi-second-to-minutes stall when MCR is throttling.

2. Async image cleanup (drop `--sync`)

The post-install ctr images rm --sync blocked CSE waiting for containerd GC to finish before returning. Dropping --sync removes the image reference immediately and lets GC reclaim space asynchronously — same disk outcome, no blocking. Small, on-path saving.

3. Defer DCGM telemetry off the critical path

nvidia-dcgm and nvidia-dcgm-exporter are monitoring only and don't gate GPU workload scheduling, yet they were started with the blocking systemctlEnableAndStart and hard-exited CSE on a slow/failed start (up to a 30s timeout each). They now use systemctlEnableAndStartNoBlock and are non-fatal. The nvidia-device-plugin start stays blocking and fatal because it gates the node advertising GPUs to the scheduler. Note: this path only runs when the managed-GPU experience is enabled, so it does not affect default GPU nodes today.

Tests

New shellspec coverage for startNvidiaManagedExpServices: asserts device-plugin stays blocking while dcgm/dcgm-exporter are enqueued off the critical path and don't fail provisioning. (7/7 GPU-service examples pass.)
go test ./pkg/agent/... — pass.
make generate — no snapshot diffs.

Risk / behavior notes

Initial Setup #1 keeps a full fallback pull when the image is missing — no regression on non-prebaked images. Image-existence check uses fixed-string whole-line matching (grep -qxF) to avoid regex false positives from dots in the image ref.
Add vhdbuilder and publisher #2 still removes the image; only the synchronous wait is gone. Worst case is slightly delayed disk reclamation on very tight disks.
import: remove iovisor apt remove from vhd #3 makes DCGM/exporter start non-fatal. This is the intended trade-off (telemetry should never block or fail node provisioning). Because systemctlEnableAndStartNoBlock returns before the service actually starts, the dcgm-exporter=enabled node label is now a feature-intent label rather than a "metrics confirmed up" signal. Reviewers who want DCGM failures to remain fatal should flag it.

Coordination

Touches configGPUDrivers(), which the held PR #8612 (prebuild GPU kernel module) also edits — these two will need a light rebase against each other whenever both land. This PR is independent and can merge on its own.

Re-opened from the same repo (was #8613 from a fork; AgentBaker requires same-repo branches for CI/secrets).

… image cleanup, defer DCGM) Three low-risk CSE-time optimizations for GPU nodes, none of which change the default driver install behavior: 1. Skip the redundant `ctr image pull` in configGPUDrivers() when the driver image is already present locally. The image is normally pre-pulled into the VHD, so the boot-time pull was paying a wasted manifest/layer round trip to MCR; we still pull as a fallback when the image is genuinely missing. 2. Drop `--sync` from the post-install `ctr images rm` so containerd garbage collection happens asynchronously instead of blocking provisioning. The image reference is still removed to reclaim disk. 3. Start nvidia-dcgm and nvidia-dcgm-exporter with systemctlEnableAndStartNoBlock and treat a slow/failed start as non-fatal. These are telemetry only and do not gate GPU workload scheduling. The nvidia-device-plugin start stays blocking and fatal because it gates the node advertising GPUs to the scheduler. Adds shellspec coverage for startNvidiaManagedExpServices asserting the device-plugin stays blocking while dcgm/dcgm-exporter are enqueued off the critical path and do not fail provisioning. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR optimizes the Ubuntu GPU provisioning path in cse_config.sh by avoiding unnecessary container registry pulls, removing a synchronous containerd GC wait, and moving DCGM telemetry service startup off the CSE critical path (non-fatal, non-blocking).

Changes:

Skip ctr image pull for the NVIDIA driver image when it already exists locally.
Remove --sync from ctr images rm to avoid blocking on containerd garbage collection.
Start nvidia-dcgm and nvidia-dcgm-exporter via systemctlEnableAndStartNoBlock and treat failures as non-fatal; keep nvidia-device-plugin blocking/fatal.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
parts/linux/cloud-init/artifacts/cse_config.sh	Adds conditional image pull, async image removal, and defers DCGM/DCGM-exporter startup off the provisioning critical path.
spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh	Adds ShellSpec coverage validating device-plugin remains blocking while DCGM services are enqueued non-blocking and are non-fatal on failure.

Address Copilot review: grep -qx treats the image ref as a regex, so dots in mcr.microsoft.com/... match any char and can yield false positives that skip a genuinely-needed pull. Use -qxF for an exact whole-line match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Per review feedback: instead of text-matching `ctr images ls -q` output with grep, query containerd directly with its exact-name filter (`images ls -q name==<ref>`) and test for empty output. ctr has no JSON/format output for images ls, but the native filter is the structured equivalent and avoids any text-matching pitfalls entirely. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

+        if [ -z "$(ctr -n k8s.io images ls -q "name==${NVIDIA_DRIVER_IMAGE}:${NVIDIA_DRIVER_IMAGE_TAG}")" ]; then
+            ctr -n k8s.io image pull $NVIDIA_DRIVER_IMAGE:$NVIDIA_DRIVER_IMAGE_TAG
+        fi


Copilot AI review requested due to automatic review settings June 1, 2026 17:21

ganeshkumarashok requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, djsly, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, sulixu, surajssd, timmy-wright and zachary-bailey as code owners June 1, 2026 17:21

ganeshkumarashok mentioned this pull request Jun 1, 2026

perf: trim GPU provisioning critical path (skip redundant pull, async cleanup, defer DCGM) #8613

Closed

Copilot started reviewing on behalf of ganeshkumarashok June 1, 2026 17:21 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Comment thread parts/linux/cloud-init/artifacts/cse_config.sh Outdated

ganeshkumarashok changed the title ~~perf: trim GPU provisioning critical path (skip redundant pull, async cleanup, defer DCGM)~~ refactor: trim GPU provisioning critical path (skip redundant pull, async cleanup, defer DCGM) Jun 1, 2026

surajssd reviewed Jun 1, 2026

View reviewed changes

Comment thread parts/linux/cloud-init/artifacts/cse_config.sh Outdated

Copilot AI review requested due to automatic review settings June 1, 2026 18:27

Copilot started reviewing on behalf of ganeshkumarashok June 1, 2026 18:27 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Comment thread parts/linux/cloud-init/artifacts/cse_config.sh

Comment on lines +1009 to +1011

if [ -z "$(ctr -n k8s.io images ls -q "name==${NVIDIA_DRIVER_IMAGE}:${NVIDIA_DRIVER_IMAGE_TAG}")" ]; then

ctr -n k8s.io image pull $NVIDIA_DRIVER_IMAGE:$NVIDIA_DRIVER_IMAGE_TAG

fi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: trim GPU provisioning critical path (skip redundant pull, async cleanup, defer DCGM)#8615

refactor: trim GPU provisioning critical path (skip redundant pull, async cleanup, defer DCGM)#8615
ganeshkumarashok wants to merge 3 commits into
mainfrom
gpu-provisioning-boot-path

ganeshkumarashok commented Jun 1, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ganeshkumarashok commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Skip the redundant driver image pull (configGPUDrivers)

2. Async image cleanup (drop --sync)

3. Defer DCGM telemetry off the critical path

Tests

Risk / behavior notes

Coordination

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ganeshkumarashok commented Jun 1, 2026 •

edited

Loading

1. Skip the redundant driver image pull (`configGPUDrivers`)

2. Async image cleanup (drop `--sync`)