You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Step 2: OCI GPU Data Plane Plugin installation on GPU Nodes
234
234
235
-
**NOTE** : Running data control plane plugin as a Kubernetes native plugin running daemon sets for [AMD MI300X nodes can be found here](./oci-scanner-plugin-amd-helm/README.md). Nvidia offering as a daemon is coming soon. Issue#22
235
+
**NOTE** : Running data control plane plugin as a Kubernetes native plugin running daemon sets for [AMD and Nvidia nodes can be found here](./oci-scanner-plugin-helm/README.md). Supported GPUs are: MI300x, MI355x, A10, H100 and B200.
236
236
237
237
1. **Navigate to Dashboards**: Go to the dashboard section of the OCI GPU Scanner Portal
238
238
2. **Go to Tab - OCI GPU Scanner Install Script**:
# Installing and Using OKE Node Problem Detector (NPD) DaemonSet with OCI GPU Scanner Service
48
+
49
+
OKE NPD is an extension of https://github.com/kubernetes/node-problem-detector that processes GPU health check failures reported by GPU Scanner service and sets conditions on the affected nodes. This feature enables proactive monitoring of GPU node health and early detection of issues.
50
+
51
+
52
+
**IMPORTANT**: The Node Problem Detector will only work on GPU nodes that are labeled with `oci.oraclecloud.com/oke-node-problem-detector-enabled="true"`. NPD will only start processing GPU health check events when drhpc is running on the nodes, so ensure that it is enabled when you install the helm chart.
0 commit comments