-
Notifications
You must be signed in to change notification settings - Fork 456
Description
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Fedora CoreOS 34
- Kernel Version: 5.14.14-200
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): OCP (OKD actually).
- GPU Operator Version: 24.3.0
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
Every time the pod nvidia-driver-daemonset restarts it install the driver over again even when the modules are loaded
3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
Just kill the “nvidia-driver-daemonset” pod and it will trigger the driver reinstall
4. Information to attach (optional if deemed irrelevant)
-
kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE
NAME READY STATUS RESTARTS AGE console-plugin-nvidia-gpu-bcc995d4d-rz6n8 1/1 Running 0 21h gpu-feature-discovery-dpk9r 2/2 Running 0 4m21s gpu-operator-6bf994488-rrlfg 1/1 Running 0 21h nvidia-container-toolkit-daemonset-47qsz 1/1 Running 0 4m22s nvidia-cuda-validator-9lmpr 0/1 Completed 0 62s nvidia-dcgm-exporter-dgm5z 1/1 Running 0 4m21s nvidia-device-plugin-daemonset-sbrrz 2/2 Running 0 4m21s nvidia-driver-daemonset-cmzkt 1/1 Running 0 5m nvidia-operator-validator-p78gj 1/1 Running 0 4m22s -
kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpu-feature-discovery 1 1 1 1 1 nvidia.com/gpu.deploy.gpu-feature-discovery=true 57d nvidia-container-toolkit-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.container-toolkit=true 38d nvidia-dcgm-exporter 1 1 1 1 1 nvidia.com/gpu.deploy.dcgm-exporter=true 57d nvidia-device-plugin-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.device-plugin=true 57d nvidia-device-plugin-mps-control-daemon 0 0 0 0 0 nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true 57d nvidia-driver-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.driver=true 38d nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 57d nvidia-operator-validator 1 1 1 1 1 nvidia.com/gpu.deploy.operator-validator=true 57d -
If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME -
If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers -
Output from running
nvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
`+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 On | 00000000:13:00.0 Off | 0 |
| N/A 52C P0 29W / 70W | 14226MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2087110 C python3 7112MiB |
| 0 N/A N/A 2088265 C python3 7112MiB |
+-----------------------------------------------------------------------------------------+`
- containerd logs
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com