Nvidia-driver always installing the driver when the pod restarts 

_The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense._

_**Important Note:  NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case [here](https://enterprise-support.nvidia.com/s/create-case)**._


### 1. Quick Debug Information
* OS/Version(e.g. RHEL8.6, Ubuntu22.04): Fedora CoreOS 34
* Kernel Version: 5.14.14-200
* Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
* K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): OCP (OKD actually).
* GPU Operator Version: 24.3.0


### 2. Issue or feature description
_Briefly explain the issue in terms of expected behavior and current behavior._
Every time the pod nvidia-driver-daemonset restarts it install the driver over again even when the modules are loaded
### 3. Steps to reproduce the issue
_Detailed steps to reproduce the issue._
Just kill the “nvidia-driver-daemonset” pod and it will trigger the driver reinstall 

### 4. Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/) (optional if deemed irrelevant)

 - [ ] kubernetes pods status: `kubectl get pods -n OPERATOR_NAMESPACE`
 `NAME                                        READY   STATUS      RESTARTS   AGE
console-plugin-nvidia-gpu-bcc995d4d-rz6n8   1/1     Running     0          21h
gpu-feature-discovery-dpk9r                 2/2     Running     0          4m21s
gpu-operator-6bf994488-rrlfg                1/1     Running     0          21h
nvidia-container-toolkit-daemonset-47qsz    1/1     Running     0          4m22s
nvidia-cuda-validator-9lmpr                 0/1     Completed   0          62s
nvidia-dcgm-exporter-dgm5z                  1/1     Running     0          4m21s
nvidia-device-plugin-daemonset-sbrrz        2/2     Running     0          4m21s
nvidia-driver-daemonset-cmzkt               1/1     Running     0          5m
nvidia-operator-validator-p78gj             1/1     Running     0          4m22s`

 - [ ] kubernetes daemonset status: `kubectl get ds -n OPERATOR_NAMESPACE`
` NAME                                      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                          AGE
gpu-feature-discovery                     1         1         1       1            1           nvidia.com/gpu.deploy.gpu-feature-discovery=true                       57d
nvidia-container-toolkit-daemonset        1         1         1       1            1           nvidia.com/gpu.deploy.container-toolkit=true                           38d
nvidia-dcgm-exporter                      1         1         1       1            1           nvidia.com/gpu.deploy.dcgm-exporter=true                               57d
nvidia-device-plugin-daemonset            1         1         1       1            1           nvidia.com/gpu.deploy.device-plugin=true                               57d
nvidia-device-plugin-mps-control-daemon   0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true   57d
nvidia-driver-daemonset                   1         1         1       1            1           nvidia.com/gpu.deploy.driver=true                                      38d
nvidia-mig-manager                        0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true                                 57d
nvidia-operator-validator                 1         1         1       1            1           nvidia.com/gpu.deploy.operator-validator=true                          57d`


 - [ ] If a pod/ds is in an error state or pending state `kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME`
 - [ ] If a pod/ds is in an error state or pending state `kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers`
 - [ ] Output from running `nvidia-smi` from the driver container: `kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi`
`+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:13:00.0 Off |                    0 |
| N/A   52C    P0             29W /   70W |   14226MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2087110      C   python3                                      7112MiB |
|    0   N/A  N/A   2088265      C   python3                                      7112MiB |
+-----------------------------------------------------------------------------------------+`
 - [ ] containerd logs `journalctl -u containerd > containerd.log`


Collecting full debug bundle (optional):

```
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh
```
**NOTE**: please refer to the [must-gather](https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh) script for debug data collected.

This bundle can be submitted to us via email: **operator_feedback@nvidia.com**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nvidia-driver always installing the driver when the pod restarts #831

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Nvidia-driver always installing the driver when the pod restarts #831

Description

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions