Enable GPU operator to install GRID driver on Azure NV instances#6
Enable GPU operator to install GRID driver on Azure NV instances#6
Conversation
precompiled.Dockerfile
Outdated
| COPY ubuntu22.04/precompiled/nvidia-driver /opt/nvidia-driver/bin/nvidia-driver | ||
| COPY nvidia-driver-wrapper.sh /usr/local/bin/nvidia-driver | ||
|
|
||
| ADD download_azure_grid_driver.sh /tmp |
There was a problem hiding this comment.
nit but for consistency reasons could you use COPY (also it is officially recommended to use COPY)
precompiled.Dockerfile
Outdated
| DEP_PACKAGES=$(apt-rdepends $BASE_PACKAGES_NAMES | grep -v "^ " | grep -v "^debconf-2.0$" | grep -v "^linux-image-unsigned-") && \ | ||
| apt-get install -y --download-only --no-install-recommends --reinstall $BASE_PACKAGES $DEP_PACKAGES | ||
|
|
||
| # Remove cuda repository before downloading dkms to avoid version conflicts |
There was a problem hiding this comment.
could you gather all the build required steps in a single block and make them only run on Azure?
| echo "Available versions: $AVAILABLE_VERSIONS" | ||
| } | ||
|
|
||
| get_grid_azure_url() { |
There was a problem hiding this comment.
I don't think we need to support all those versions, especially since they are hardcoded anyway. Only keeping 1 (the latest) per driver branch would shorten the script a little bit
| NVIDIA_UVM_MODULE_PARAMS=() | ||
| NVIDIA_MODESET_MODULE_PARAMS=() | ||
| NVIDIA_PEERMEM_MODULE_PARAMS=() | ||
| TARGETARCH=${TARGETARCH:?"Missing TARGETARCH env"} |
There was a problem hiding this comment.
This is more or less the upstream nvidia-driver script. For the sake of keeping it easy to rebase, could you please reduce to a bare minimum (a line of script import) all changes that are related to Azure specificities and put everything you add in a separate script?
There was a problem hiding this comment.
If we introduce a separate script file we need to copy it into docker image. If we only update the precompiled.Dockerfile to copy the new script file with GRID-related functions, the ubuntu22.04/Dockerfile and the ubuntu22.04/precompiled/Dockerfile will be invalid. Is it ok?
There was a problem hiding this comment.
Another approach could be to modify the nvidia-driver-wrapper.sh and make a check if GRID driver is required there. Something like:
// nvidia-driver-wrapper.sh
...
if _is_grid_driver_required; then
/opt/nvidia-driver/bin/nvidia-grid-driver $@
else
/opt/nvidia-driver/bin/nvidia-driver $@
fi
...In that case /opt/nvidia-driver/bin/nvidia-grid-driver will source functions from the /opt/nvidia-driver/bin/nvidia-driver. But it might be overcomplicated because the /opt/nvidia-driver/bin/nvidia-grid-driver will duplicate some functionality of the /opt/nvidia-driver/bin/nvidia-driver
| exit 1 | ||
| fi | ||
|
|
||
| # Updating gridd.conf |
There was a problem hiding this comment.
Maybe add a link to the doc here because it's not obvious why we are doing this here
precompiled.Dockerfile
Outdated
| # CUDA repo has dkms 1:3.3.0 but Ubuntu has 2.8.7 - we need Ubuntu version for runtime | ||
| # Note: We remove repo files but don't run apt-get update to preserve package cache | ||
| # for runtime installation of precompiled driver packages | ||
| RUN rm -f /etc/apt/sources.list.d/cuda* |
There was a problem hiding this comment.
I know removing the /etc/apt/sources.list.d/cuda* file has been an issue in some cases where we could not find some packages. Can you try doing apt install nvlsm for instance?
There was a problem hiding this comment.
I removed this line at all. looks like it works with dkms 3.3.0 from cuda repo.
Description
Extend GPU operator to install Azure GRID driver on the corresponding Azure NV instances.
Testing
Breadcrumbs
Jira: NODES-487