Skip to content

Comments

Enable GPU operator to install GRID driver on Azure NV instances#6

Open
vinser52 wants to merge 10 commits intodatadogfrom
sergei.vinogradov/nvidia-grid-driver
Open

Enable GPU operator to install GRID driver on Azure NV instances#6
vinser52 wants to merge 10 commits intodatadogfrom
sergei.vinogradov/nvidia-grid-driver

Conversation

@vinser52
Copy link

@vinser52 vinser52 commented Feb 20, 2026

Description

Extend GPU operator to install Azure GRID driver on the corresponding Azure NV instances.

Testing

  • On arbok.us3.staging.dog cluster to make sure NV instances can be managed by the GPU operator
  • On charcadet.us1.staging.dog to make sure it doesn't break existing functionality of the GPU operator

Breadcrumbs

Jira: NODES-487

COPY ubuntu22.04/precompiled/nvidia-driver /opt/nvidia-driver/bin/nvidia-driver
COPY nvidia-driver-wrapper.sh /usr/local/bin/nvidia-driver

ADD download_azure_grid_driver.sh /tmp

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but for consistency reasons could you use COPY (also it is officially recommended to use COPY)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

DEP_PACKAGES=$(apt-rdepends $BASE_PACKAGES_NAMES | grep -v "^ " | grep -v "^debconf-2.0$" | grep -v "^linux-image-unsigned-") && \
apt-get install -y --download-only --no-install-recommends --reinstall $BASE_PACKAGES $DEP_PACKAGES

# Remove cuda repository before downloading dkms to avoid version conflicts

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you gather all the build required steps in a single block and make them only run on Azure?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

echo "Available versions: $AVAILABLE_VERSIONS"
}

get_grid_azure_url() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to support all those versions, especially since they are hardcoded anyway. Only keeping 1 (the latest) per driver branch would shorten the script a little bit

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

NVIDIA_UVM_MODULE_PARAMS=()
NVIDIA_MODESET_MODULE_PARAMS=()
NVIDIA_PEERMEM_MODULE_PARAMS=()
TARGETARCH=${TARGETARCH:?"Missing TARGETARCH env"}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more or less the upstream nvidia-driver script. For the sake of keeping it easy to rebase, could you please reduce to a bare minimum (a line of script import) all changes that are related to Azure specificities and put everything you add in a separate script?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we introduce a separate script file we need to copy it into docker image. If we only update the precompiled.Dockerfile to copy the new script file with GRID-related functions, the ubuntu22.04/Dockerfile and the ubuntu22.04/precompiled/Dockerfile will be invalid. Is it ok?

Copy link
Author

@vinser52 vinser52 Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach could be to modify the nvidia-driver-wrapper.sh and make a check if GRID driver is required there. Something like:

// nvidia-driver-wrapper.sh
...

if _is_grid_driver_required; then
      /opt/nvidia-driver/bin/nvidia-grid-driver $@
  else
      /opt/nvidia-driver/bin/nvidia-driver $@
  fi
...

In that case /opt/nvidia-driver/bin/nvidia-grid-driver will source functions from the /opt/nvidia-driver/bin/nvidia-driver. But it might be overcomplicated because the /opt/nvidia-driver/bin/nvidia-grid-driver will duplicate some functionality of the /opt/nvidia-driver/bin/nvidia-driver

exit 1
fi

# Updating gridd.conf

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a link to the doc here because it's not obvious why we are doing this here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# CUDA repo has dkms 1:3.3.0 but Ubuntu has 2.8.7 - we need Ubuntu version for runtime
# Note: We remove repo files but don't run apt-get update to preserve package cache
# for runtime installation of precompiled driver packages
RUN rm -f /etc/apt/sources.list.d/cuda*

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know removing the /etc/apt/sources.list.d/cuda* file has been an issue in some cases where we could not find some packages. Can you try doing apt install nvlsm for instance?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this line at all. looks like it works with dkms 3.3.0 from cuda repo.

@vinser52 vinser52 requested a review from just1not2 February 23, 2026 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants