Skip to content

Bake nouveau blacklist into Ubuntu VHD to cut GPU node boot time#8614

Draft
ganeshkumarashok wants to merge 1 commit into
Azure:mainfrom
ganeshkumarashok:gpu-bake-nouveau-vhd
Draft

Bake nouveau blacklist into Ubuntu VHD to cut GPU node boot time#8614
ganeshkumarashok wants to merge 1 commit into
Azure:mainfrom
ganeshkumarashok:gpu-bake-nouveau-vhd

Conversation

@ganeshkumarashok
Copy link
Copy Markdown
Contributor

What

Bake the nouveau blacklist + initramfs into the shared Ubuntu amd64 VHD at build time and write a kernel-gated marker, so aks-gpu/install.sh can skip its per-boot update-initramfs -u (~10-30s on GPU node boot).

Added to the existing Ubuntu amd64 GPU block in install-dependencies.sh (which already pre-pulls the aks-gpu-cuda image):

  • write /etc/modprobe.d/blacklist-nouveau.conf (same 2 lines aks-gpu installs),
  • update-initramfs -u,
  • write /opt/azure/aks-gpu/nouveau-blacklist-marker containing kernel=$(uname -r).

Runs after the kernel purge/reinstall/reboot to the final shipped kernel (logged near the top of the script), so uname -r is the node's boot kernel and the baked initramfs matches it.

Why

GPU provisioning-time reduction. Removes a deterministic per-boot cost and makes nouveau blacklisted from first boot.

Safety on the shared VHD

The Ubuntu VHD is shared with non-GPU nodes. Blacklisting nouveau is safe there: AKS Ubuntu node images have no functional dependency on nouveau, while GPU nodes require it disabled before the proprietary driver loads. This mirrors the existing NVIDIA GB VHD path, which already bakes the same blacklist + initramfs.

Cross-repo dependency

The boot-time skip lives in Azure/aks-gpu PR #161. The skip only triggers when the marker kernel matches AND the on-disk blacklist content matches the image's copy, so old aks-gpu images simply ignore the marker and rebuild as before. Draft pending a VHD build validation.

Validation

  • bash -n clean; shellcheck introduces no new findings on the added lines.
  • Confirmed the baked blacklist-nouveau.conf is byte-identical (44 bytes) to aks-gpu's /opt/gpu/blacklist-nouveau.conf so the cmp fast-path gate engages.

GPU nodes pay a ~10-30s `update-initramfs -u` at every boot when aks-gpu's
install.sh blacklists nouveau. Bake the blacklist + initramfs into the shared
Ubuntu amd64 VHD at build time (after the final kernel is in place) and write a
kernel-gated marker so aks-gpu/install.sh can skip the per-boot rebuild. Safe on
the shared VHD: AKS Ubuntu node images have no functional dependency on nouveau,
and GPU nodes require it disabled before the proprietary driver loads. Mirrors
the existing NVIDIA GB VHD path. Pairs with the matching aks-gpu change; older
images simply ignore the marker and rebuild as before.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Ubuntu amd64 VHD build path to pre-bake the nouveau blacklist into the VHD’s initramfs and write a kernel-scoped marker file, enabling GPU node boot-time logic (in aks-gpu) to skip a costly per-boot update-initramfs -u.

Changes:

  • Write /etc/modprobe.d/blacklist-nouveau.conf on Ubuntu amd64 during VHD build (same blacklist content used by aks-gpu).
  • Run update-initramfs -u at build time so the blacklist is present from first boot.
  • Write /opt/azure/aks-gpu/nouveau-blacklist-marker containing kernel=$(uname -r) and add a VHD build log entry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants