Bake nouveau blacklist into Ubuntu VHD to cut GPU node boot time#8614
Draft
ganeshkumarashok wants to merge 1 commit into
Draft
Bake nouveau blacklist into Ubuntu VHD to cut GPU node boot time#8614ganeshkumarashok wants to merge 1 commit into
ganeshkumarashok wants to merge 1 commit into
Conversation
GPU nodes pay a ~10-30s `update-initramfs -u` at every boot when aks-gpu's install.sh blacklists nouveau. Bake the blacklist + initramfs into the shared Ubuntu amd64 VHD at build time (after the final kernel is in place) and write a kernel-gated marker so aks-gpu/install.sh can skip the per-boot rebuild. Safe on the shared VHD: AKS Ubuntu node images have no functional dependency on nouveau, and GPU nodes require it disabled before the proprietary driver loads. Mirrors the existing NVIDIA GB VHD path. Pairs with the matching aks-gpu change; older images simply ignore the marker and rebuild as before. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the Ubuntu amd64 VHD build path to pre-bake the nouveau blacklist into the VHD’s initramfs and write a kernel-scoped marker file, enabling GPU node boot-time logic (in aks-gpu) to skip a costly per-boot update-initramfs -u.
Changes:
- Write
/etc/modprobe.d/blacklist-nouveau.confon Ubuntu amd64 during VHD build (same blacklist content used by aks-gpu). - Run
update-initramfs -uat build time so the blacklist is present from first boot. - Write
/opt/azure/aks-gpu/nouveau-blacklist-markercontainingkernel=$(uname -r)and add a VHD build log entry.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Bake the nouveau blacklist + initramfs into the shared Ubuntu amd64 VHD at build time and write a kernel-gated marker, so
aks-gpu/install.shcan skip its per-bootupdate-initramfs -u(~10-30s on GPU node boot).Added to the existing Ubuntu amd64 GPU block in
install-dependencies.sh(which already pre-pulls the aks-gpu-cuda image):/etc/modprobe.d/blacklist-nouveau.conf(same 2 lines aks-gpu installs),update-initramfs -u,/opt/azure/aks-gpu/nouveau-blacklist-markercontainingkernel=$(uname -r).Runs after the kernel purge/reinstall/reboot to the final shipped kernel (logged near the top of the script), so
uname -ris the node's boot kernel and the baked initramfs matches it.Why
GPU provisioning-time reduction. Removes a deterministic per-boot cost and makes nouveau blacklisted from first boot.
Safety on the shared VHD
The Ubuntu VHD is shared with non-GPU nodes. Blacklisting nouveau is safe there: AKS Ubuntu node images have no functional dependency on nouveau, while GPU nodes require it disabled before the proprietary driver loads. This mirrors the existing NVIDIA GB VHD path, which already bakes the same blacklist + initramfs.
Cross-repo dependency
The boot-time skip lives in Azure/aks-gpu PR #161. The skip only triggers when the marker kernel matches AND the on-disk blacklist content matches the image's copy, so old aks-gpu images simply ignore the marker and rebuild as before. Draft pending a VHD build validation.
Validation
bash -nclean;shellcheckintroduces no new findings on the added lines.blacklist-nouveau.confis byte-identical (44 bytes) to aks-gpu's/opt/gpu/blacklist-nouveau.confso thecmpfast-path gate engages.