Skip to content

refactor: trim GPU provisioning critical path (skip redundant pull, async cleanup, defer DCGM)#8615

Open
ganeshkumarashok wants to merge 3 commits into
mainfrom
gpu-provisioning-boot-path
Open

refactor: trim GPU provisioning critical path (skip redundant pull, async cleanup, defer DCGM)#8615
ganeshkumarashok wants to merge 3 commits into
mainfrom
gpu-provisioning-boot-path

Conversation

@ganeshkumarashok
Copy link
Copy Markdown
Contributor

@ganeshkumarashok ganeshkumarashok commented Jun 1, 2026

Summary

Three small, independent, low-risk optimizations that remove redundant waits from the GPU node provisioning critical path on Ubuntu. None of these touch the dominant cost — the per-node DKMS driver compile + cp -a /opt/gpu inside configGPUDrivers, which is ~95% of GPU CSE time. The intent here is hygiene and tail-latency protection, not the headline speedup. The large win (moving the driver build off the boot path / pre-laying the driver tree on the VHD) is tracked separately.

Expected impact: low single-digit seconds off the median for a default GPU node, plus meaningful p99/throttling-tail protection. The DCGM change (#3) only affects the managed-GPU experience path.

1. Skip the redundant driver image pull (configGPUDrivers)

The aks-gpu-cuda image is pre-fetched into the VHD at build time (install-dependencies.shimage-fetcher), so it is already present locally at boot. configGPUDrivers() nonetheless ran ctr image pull unconditionally — a wasted manifest/layer round trip to MCR and exposure to MCR throttling. Now we only pull when the image is genuinely absent locally; otherwise we go straight to ctr run. Median savings are small; the real value is avoiding a multi-second-to-minutes stall when MCR is throttling.

2. Async image cleanup (drop --sync)

The post-install ctr images rm --sync blocked CSE waiting for containerd GC to finish before returning. Dropping --sync removes the image reference immediately and lets GC reclaim space asynchronously — same disk outcome, no blocking. Small, on-path saving.

3. Defer DCGM telemetry off the critical path

nvidia-dcgm and nvidia-dcgm-exporter are monitoring only and don't gate GPU workload scheduling, yet they were started with the blocking systemctlEnableAndStart and hard-exited CSE on a slow/failed start (up to a 30s timeout each). They now use systemctlEnableAndStartNoBlock and are non-fatal. The nvidia-device-plugin start stays blocking and fatal because it gates the node advertising GPUs to the scheduler. Note: this path only runs when the managed-GPU experience is enabled, so it does not affect default GPU nodes today.

Tests

  • New shellspec coverage for startNvidiaManagedExpServices: asserts device-plugin stays blocking while dcgm/dcgm-exporter are enqueued off the critical path and don't fail provisioning. (7/7 GPU-service examples pass.)
  • go test ./pkg/agent/... — pass.
  • make generate — no snapshot diffs.

Risk / behavior notes

  • Initial Setup #1 keeps a full fallback pull when the image is missing — no regression on non-prebaked images. Image-existence check uses fixed-string whole-line matching (grep -qxF) to avoid regex false positives from dots in the image ref.
  • Add vhdbuilder and publisher #2 still removes the image; only the synchronous wait is gone. Worst case is slightly delayed disk reclamation on very tight disks.
  • import: remove iovisor apt remove from vhd #3 makes DCGM/exporter start non-fatal. This is the intended trade-off (telemetry should never block or fail node provisioning). Because systemctlEnableAndStartNoBlock returns before the service actually starts, the dcgm-exporter=enabled node label is now a feature-intent label rather than a "metrics confirmed up" signal. Reviewers who want DCGM failures to remain fatal should flag it.

Coordination

Touches configGPUDrivers(), which the held PR #8612 (prebuild GPU kernel module) also edits — these two will need a light rebase against each other whenever both land. This PR is independent and can merge on its own.


Re-opened from the same repo (was #8613 from a fork; AgentBaker requires same-repo branches for CI/secrets).

… image cleanup, defer DCGM)

Three low-risk CSE-time optimizations for GPU nodes, none of which change the
default driver install behavior:

1. Skip the redundant `ctr image pull` in configGPUDrivers() when the driver
   image is already present locally. The image is normally pre-pulled into the
   VHD, so the boot-time pull was paying a wasted manifest/layer round trip to
   MCR; we still pull as a fallback when the image is genuinely missing.

2. Drop `--sync` from the post-install `ctr images rm` so containerd garbage
   collection happens asynchronously instead of blocking provisioning. The
   image reference is still removed to reclaim disk.

3. Start nvidia-dcgm and nvidia-dcgm-exporter with
   systemctlEnableAndStartNoBlock and treat a slow/failed start as non-fatal.
   These are telemetry only and do not gate GPU workload scheduling. The
   nvidia-device-plugin start stays blocking and fatal because it gates the
   node advertising GPUs to the scheduler.

Adds shellspec coverage for startNvidiaManagedExpServices asserting the
device-plugin stays blocking while dcgm/dcgm-exporter are enqueued off the
critical path and do not fail provisioning.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the Ubuntu GPU provisioning path in cse_config.sh by avoiding unnecessary container registry pulls, removing a synchronous containerd GC wait, and moving DCGM telemetry service startup off the CSE critical path (non-fatal, non-blocking).

Changes:

  • Skip ctr image pull for the NVIDIA driver image when it already exists locally.
  • Remove --sync from ctr images rm to avoid blocking on containerd garbage collection.
  • Start nvidia-dcgm and nvidia-dcgm-exporter via systemctlEnableAndStartNoBlock and treat failures as non-fatal; keep nvidia-device-plugin blocking/fatal.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
parts/linux/cloud-init/artifacts/cse_config.sh Adds conditional image pull, async image removal, and defers DCGM/DCGM-exporter startup off the provisioning critical path.
spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh Adds ShellSpec coverage validating device-plugin remains blocking while DCGM services are enqueued non-blocking and are non-fatal on failure.

Comment thread parts/linux/cloud-init/artifacts/cse_config.sh Outdated
Address Copilot review: grep -qx treats the image ref as a regex, so dots
in mcr.microsoft.com/... match any char and can yield false positives that
skip a genuinely-needed pull. Use -qxF for an exact whole-line match.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ganeshkumarashok ganeshkumarashok changed the title perf: trim GPU provisioning critical path (skip redundant pull, async cleanup, defer DCGM) refactor: trim GPU provisioning critical path (skip redundant pull, async cleanup, defer DCGM) Jun 1, 2026
Comment thread parts/linux/cloud-init/artifacts/cse_config.sh Outdated
Per review feedback: instead of text-matching `ctr images ls -q` output with
grep, query containerd directly with its exact-name filter
(`images ls -q name==<ref>`) and test for empty output. ctr has no JSON/format
output for images ls, but the native filter is the structured equivalent and
avoids any text-matching pitfalls entirely.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 1, 2026 18:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment on lines +1009 to +1011
if [ -z "$(ctr -n k8s.io images ls -q "name==${NVIDIA_DRIVER_IMAGE}:${NVIDIA_DRIVER_IMAGE_TAG}")" ]; then
ctr -n k8s.io image pull $NVIDIA_DRIVER_IMAGE:$NVIDIA_DRIVER_IMAGE_TAG
fi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants