Skip to content

Commit c2550ff

Browse files
committed
Adding NPD and updating readmes
Signed-off-by: Ritika Gupta <rtkgupta203@gmail.com>
1 parent 35a54fd commit c2550ff

File tree

4 files changed

+706
-4
lines changed

4 files changed

+706
-4
lines changed

GETTING_STARTED_HELM_DEPLOY.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -232,7 +232,7 @@ helm install lens ./helm -n lens \
232232

233233
## Step 2: OCI GPU Data Plane Plugin installation on GPU Nodes
234234

235-
**NOTE** : Running data control plane plugin as a Kubernetes native plugin running daemon sets for [AMD MI300X nodes can be found here](./oci-scanner-plugin-amd-helm/README.md). Nvidia offering as a daemon is coming soon. Issue#22
235+
**NOTE** : Running data control plane plugin as a Kubernetes native plugin running daemon sets for [AMD and Nvidia nodes can be found here](./oci-scanner-plugin-helm/README.md). Supported GPUs are: MI300x, MI355x, A10, H100 and B200.
236236

237237
1. **Navigate to Dashboards**: Go to the dashboard section of the OCI GPU Scanner Portal
238238
2. **Go to Tab - OCI GPU Scanner Install Script**:

oci-scanner-plugin-helm/README.md

Lines changed: 37 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ Multi-vendor GPU monitoring and health check solution for OCI compute instances
1212
- **Pod Node Mapper**: Pod-to-node relationship tracking
1313
- **Health Check**: GPU performance testing (optional)
1414
- **DRHPC**: Distributed diagnostic monitoring for both AMD and NVIDIA
15+
- **Node Problem Detector**: GPU health monitoring via DRHPC integration (requires labeling)
1516

1617
## Configuration
1718

@@ -27,6 +28,11 @@ helm install oci-gpu-scanner-plugin . -f values.yaml -n oci-gpu-scanner-plugin \
2728
helm install oci-gpu-scanner-plugin ./oci-scanner-plugin-amd-helm \
2829
--set healthCheck.enabled=true
2930

31+
# Enable Node Problem Detector (requires node labeling and drhpc to be enabled- see below)
32+
helm upgrade oci-gpu-scanner-plugin . \
33+
--set nodeProblemDetector.enabled=true \
34+
--set drhpc.enabled=true
35+
3036
# Uninstall
3137
helm uninstall oci-gpu-scanner-plugin -n oci-gpu-scanner-plugin
3238
```
@@ -36,4 +42,34 @@ helm uninstall oci-gpu-scanner-plugin -n oci-gpu-scanner-plugin
3642
- Kubernetes cluster with AMD / Nvidia GPU nodes
3743
- Prometheus Push Gateway accessible from cluster
3844
- AMD GPU drivers installed on nodes
39-
- Nvidia GPU Drivers installed on the nodes
45+
- Nvidia GPU Drivers installed on the nodes
46+
47+
## Node Problem Detector Setup
48+
49+
**IMPORTANT**: The Node Problem Detector will only work on GPU nodes that are labeled with `oci.oraclecloud.com/oke-node-problem-detector-enabled="true"`. And it reads these metrics from drhpc, so ensure that is enabled while deploying.
50+
51+
Before enabling NPD, label your GPU nodes:
52+
53+
```bash
54+
# Label individual nodes
55+
kubectl label nodes <node-name> oci.oraclecloud.com/oke-node-problem-detector-enabled=true
56+
57+
# Label all AMD GPU nodes
58+
kubectl label nodes --selector=amd.com/gpu=true oci.oraclecloud.com/oke-node-problem-detector-enabled=true
59+
60+
# Label all NVIDIA GPU nodes
61+
kubectl label nodes --selector=nvidia.com/gpu=true oci.oraclecloud.com/oke-node-problem-detector-enabled=true
62+
63+
# Verify labels
64+
kubectl get nodes --show-labels | grep oke-node-problem-detector-enabled
65+
```
66+
67+
Then enable NPD:
68+
69+
```bash
70+
helm upgrade oci-gpu-scanner-plugin . \
71+
--set nodeProblemDetector.enabled=true \
72+
--set drhpc.enabled=true
73+
```
74+
75+
**Note**: NPD requires DRHPC to be enabled and running to provide GPU health check data.

0 commit comments

Comments
 (0)