Skip to content

Commit 5db4366

Browse files
authored
Merge pull request #34 from oracle-quickstart/rg_plugin_helm
Node problem detector integration into helm charts
2 parents 8c21166 + f7142f6 commit 5db4366

File tree

5 files changed

+145
-73
lines changed

5 files changed

+145
-73
lines changed

GETTING_STARTED_HELM_DEPLOY.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -232,7 +232,7 @@ helm install lens ./helm -n lens \
232232

233233
## Step 2: OCI GPU Data Plane Plugin installation on GPU Nodes
234234

235-
**NOTE** : Running data control plane plugin as a Kubernetes native plugin running daemon sets for [AMD MI300X nodes can be found here](./oci-scanner-plugin-amd-helm/README.md). Nvidia offering as a daemon is coming soon. Issue#22
235+
**NOTE** : Running data control plane plugin as a Kubernetes native plugin running daemon sets for [AMD and Nvidia nodes can be found here](./oci-scanner-plugin-helm/README.md). Supported GPUs are: MI300x, MI355x, A10, H100 and B200.
236236

237237
1. **Navigate to Dashboards**: Go to the dashboard section of the OCI GPU Scanner Portal
238238
2. **Go to Tab - OCI GPU Scanner Install Script**:

OKE_NPD_DEPLOY.md

Lines changed: 0 additions & 39 deletions
This file was deleted.

oci-scanner-plugin-helm/README.md

Lines changed: 48 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ Multi-vendor GPU monitoring and health check solution for OCI compute instances
1212
- **Pod Node Mapper**: Pod-to-node relationship tracking
1313
- **Health Check**: GPU performance testing (optional)
1414
- **DRHPC**: Distributed diagnostic monitoring for both AMD and NVIDIA
15+
- **Node Problem Detector**: GPU health monitoring via DRHPC integration (requires labeling)
1516

1617
## Configuration
1718

@@ -27,6 +28,11 @@ helm install oci-gpu-scanner-plugin . -f values.yaml -n oci-gpu-scanner-plugin \
2728
helm install oci-gpu-scanner-plugin ./oci-scanner-plugin-amd-helm \
2829
--set healthCheck.enabled=true
2930

31+
# Enable Node Problem Detector (requires node labeling and drhpc to be enabled- see below)
32+
helm upgrade oci-gpu-scanner-plugin . \
33+
--set nodeProblemDetector.enabled=true \
34+
--set drhpc.enabled=true
35+
3036
# Uninstall
3137
helm uninstall oci-gpu-scanner-plugin -n oci-gpu-scanner-plugin
3238
```
@@ -36,4 +42,45 @@ helm uninstall oci-gpu-scanner-plugin -n oci-gpu-scanner-plugin
3642
- Kubernetes cluster with AMD / Nvidia GPU nodes
3743
- Prometheus Push Gateway accessible from cluster
3844
- AMD GPU drivers installed on nodes
39-
- Nvidia GPU Drivers installed on the nodes
45+
- Nvidia GPU Drivers installed on the nodes
46+
47+
# Installing and Using OKE Node Problem Detector (NPD) DaemonSet with OCI GPU Scanner Service
48+
49+
OKE NPD is an extension of https://github.com/kubernetes/node-problem-detector that processes GPU health check failures reported by GPU Scanner service and sets conditions on the affected nodes. This feature enables proactive monitoring of GPU node health and early detection of issues.
50+
51+
52+
**IMPORTANT**: The Node Problem Detector will only work on GPU nodes that are labeled with `oci.oraclecloud.com/oke-node-problem-detector-enabled="true"`. NPD will only start processing GPU health check events when drhpc is running on the nodes, so ensure that it is enabled when you install the helm chart.
53+
54+
Before enabling NPD, label your GPU nodes:
55+
56+
```bash
57+
# Label individual nodes
58+
kubectl label nodes <node-name> oci.oraclecloud.com/oke-node-problem-detector-enabled=true
59+
60+
# Label all AMD GPU nodes
61+
kubectl label nodes --selector=amd.com/gpu=true oci.oraclecloud.com/oke-node-problem-detector-enabled=true
62+
63+
# Label all NVIDIA GPU nodes
64+
kubectl label nodes --selector=nvidia.com/gpu=true oci.oraclecloud.com/oke-node-problem-detector-enabled=true
65+
66+
# Verify labels
67+
kubectl get nodes --show-labels | grep oke-node-problem-detector-enabled
68+
```
69+
70+
Then enable NPD:
71+
72+
```bash
73+
helm upgrade oci-gpu-scanner-plugin . \
74+
--set nodeProblemDetector.enabled=true \
75+
--set drhpc.enabled=true
76+
```
77+
78+
**Note**: NPD requires DRHPC to be enabled and running to provide GPU health check data.
79+
80+
Verify that NPD DaemonSet has been installed successfully and running.
81+
82+
```bash
83+
kubectl get pods -l app=oke-node-problem-detector -o wide -n kube-system
84+
```
85+
86+
Results should show ```oke-node-problem-detector``` in running state for all targeted GPU nodes.

existing_cluster_deploy/oke-node-problem-detector.yaml renamed to oci-scanner-plugin-helm/templates/node-problem-detector.yaml

Lines changed: 32 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,16 @@
1+
{{- if .Values.nodeProblemDetector.enabled }}
2+
---
13
apiVersion: apps/v1
24
kind: DaemonSet
35
metadata:
46
name: oke-node-problem-detector
5-
namespace: kube-system
7+
namespace: {{ .Values.nodeProblemDetector.namespace | default "kube-system" }}
68
labels:
79
app: oke-node-problem-detector
10+
component: gpu-monitoring
11+
{{- with .Values.global.labels }}
12+
{{- toYaml . | nindent 4 }}
13+
{{- end }}
814
spec:
915
selector:
1016
matchLabels:
@@ -13,9 +19,14 @@ spec:
1319
metadata:
1420
labels:
1521
app: oke-node-problem-detector
22+
component: gpu-monitoring
1623
spec:
1724
nodeSelector:
25+
{{- if .Values.nodeProblemDetector.nodeSelector }}
26+
{{- toYaml .Values.nodeProblemDetector.nodeSelector | nindent 8 }}
27+
{{- else }}
1828
oci.oraclecloud.com/oke-node-problem-detector-enabled: "true"
29+
{{- end }}
1930
affinity:
2031
nodeAffinity:
2132
requiredDuringSchedulingIgnoredDuringExecution:
@@ -30,7 +41,9 @@ spec:
3041
- /node-problem-detector --logtostderr --prometheus-port=${PROMETHEUS_PORT}
3142
--prometheus-address 0.0.0.0 --config.system-log-monitor=/config/kernel-monitor.json,/config/readonly-monitor.json
3243
--config.custom-plugin-monitor=/node-problem-detector-custom-check/imds_reachability.json
44+
{{- if .Values.nodeProblemDetector.enableGpuChecks }}
3345
--config.custom-plugin-monitor=/node-problem-detector-gpu-check/dr-hpc.json
46+
{{- end }}
3447
--enable-k8s-exporter=true
3548
command:
3649
- /bin/sh
@@ -42,21 +55,16 @@ spec:
4255
apiVersion: v1
4356
fieldPath: spec.nodeName
4457
- name: PROMETHEUS_PORT
45-
value: "20257"
46-
image: phx.ocir.io/idnlixcmffxd/oke-public-node-problem-detector:v0.8.20.7@sha256:399b506dbfa5c33e60a247d0d3199f025d242b7a7480c956446e70eaa090c599
47-
imagePullPolicy: Always
58+
value: {{ .Values.nodeProblemDetector.prometheusPort | default "20257" | quote }}
59+
image: {{ .Values.nodeProblemDetector.image.repository }}:{{ .Values.nodeProblemDetector.image.tag }}@{{ .Values.nodeProblemDetector.image.sha256 }}
60+
imagePullPolicy: {{ .Values.nodeProblemDetector.image.pullPolicy | default "Always" }}
4861
name: oke-node-problem-detector
4962
ports:
50-
- containerPort: 20257
63+
- containerPort: {{ .Values.nodeProblemDetector.prometheusPort | default 20257 }}
5164
name: metrics
5265
protocol: TCP
5366
resources:
54-
limits:
55-
cpu: 10m
56-
memory: 80Mi
57-
requests:
58-
cpu: 10m
59-
memory: 80Mi
67+
{{- toYaml .Values.nodeProblemDetector.resources | nindent 12 }}
6068
securityContext:
6169
privileged: true
6270
volumeMounts:
@@ -74,27 +82,17 @@ spec:
7482
- mountPath: /node-problem-detector-custom-check
7583
name: node-problem-detector-custom-check
7684
readOnly: true
85+
{{- if .Values.nodeProblemDetector.enableGpuChecks }}
7786
- mountPath: /node-problem-detector-gpu-check
7887
name: node-problem-detector-gpu-check
7988
readOnly: true
89+
{{- end }}
8090
serviceAccountName: oke-node-problem-detector-sa
8191
tolerations:
82-
- key: CriticalAddonsOnly
83-
operator: Exists
84-
- key: oci.oraclecloud.com/oke-is-preemptible
85-
operator: Exists
86-
- effect: NoSchedule
87-
key: nvidia.com/gpu
88-
operator: Exists
89-
- effect: NoSchedule
90-
key: amd.com/gpu
91-
operator: Exists
92-
- effect: NoSchedule
93-
key: oci.oraclecloud.com/node-auto-repair-scheduled
94-
operator: Exists
92+
{{- toYaml .Values.nodeProblemDetector.tolerations | nindent 8 }}
9593
volumes:
9694
- hostPath:
97-
path: /home/ubuntu/oci-dr-hpc-v2/
95+
path: {{ .Values.nodeProblemDetector.drhpcResultsPath | default "/var/lib/oci-dr-hpc-v2" }}
9896
name: log
9997
- hostPath:
10098
path: /dev/kmsg
@@ -109,17 +107,19 @@ spec:
109107
defaultMode: 493
110108
name: node-problem-detector-custom-check
111109
name: node-problem-detector-custom-check
110+
{{- if .Values.nodeProblemDetector.enableGpuChecks }}
112111
- configMap:
113112
defaultMode: 493
114113
name: node-problem-detector-gpu-check
115114
name: node-problem-detector-gpu-check
115+
{{- end }}
116116

117117
---
118118
apiVersion: v1
119119
kind: ConfigMap
120120
metadata:
121121
name: node-problem-detector-custom-check
122-
namespace: kube-system
122+
namespace: {{ .Values.nodeProblemDetector.namespace | default "kube-system" }}
123123
data:
124124
imds_reachability.sh: |
125125
#!/bin/bash
@@ -147,7 +147,6 @@ data:
147147
exit 1
148148
fi
149149
150-
151150
imds_reachability.json: |
152151
{
153152
"plugin": "custom",
@@ -177,12 +176,13 @@ data:
177176
]
178177
}
179178
179+
{{- if .Values.nodeProblemDetector.enableGpuChecks }}
180180
---
181181
apiVersion: v1
182182
kind: ConfigMap
183183
metadata:
184184
name: node-problem-detector-gpu-check
185-
namespace: kube-system
185+
namespace: {{ .Values.nodeProblemDetector.namespace | default "kube-system" }}
186186
data:
187187
dr_hpc_check.sh: |
188188
#!/bin/bash
@@ -579,13 +579,14 @@ data:
579579
}
580580
]
581581
}
582+
{{- end }}
582583

583584
---
584585
apiVersion: v1
585586
kind: ServiceAccount
586587
metadata:
587588
name: oke-node-problem-detector-sa
588-
namespace: kube-system
589+
namespace: {{ .Values.nodeProblemDetector.namespace | default "kube-system" }}
589590

590591
---
591592
apiVersion: rbac.authorization.k8s.io/v1
@@ -599,4 +600,5 @@ roleRef:
599600
subjects:
600601
- kind: ServiceAccount
601602
name: oke-node-problem-detector-sa
602-
namespace: kube-system
603+
namespace: {{ .Values.nodeProblemDetector.namespace | default "kube-system" }}
604+
{{- end }}

oci-scanner-plugin-helm/values.yaml

Lines changed: 64 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -163,7 +163,7 @@ rbac:
163163

164164
# Node Exporter
165165
nodeExporter:
166-
enabled: true
166+
enabled: false
167167

168168
# Override default values for the official chart
169169
prometheus-node-exporter:
@@ -209,4 +209,66 @@ podNodeMapper:
209209
cpu: "50m"
210210
limits:
211211
memory: "256Mi"
212-
cpu: "200m"
212+
cpu: "200m"
213+
214+
# Node Problem Detector
215+
nodeProblemDetector:
216+
enabled: false
217+
enableGpuChecks: true # Enable GPU health checks via DRHPC
218+
219+
# DRHPC results path - must match the hostPath where DRHPC writes results
220+
# Default is /var/lib/oci-dr-hpc-v2 which matches drhpc.resultsHostPath
221+
drhpcResultsPath: "/var/lib/oci-dr-hpc-v2"
222+
223+
image:
224+
repository: phx.ocir.io/idnlixcmffxd/oke-public-node-problem-detector
225+
tag: v0.8.20.7
226+
sha256: sha256:399b506dbfa5c33e60a247d0d3199f025d242b7a7480c956446e70eaa090c599
227+
pullPolicy: Always
228+
229+
prometheusPort: 20257
230+
231+
# Node Problem Detector
232+
nodeProblemDetector:
233+
enabled: false
234+
namespace: kube-system
235+
236+
# DRHPC results path - must match the hostPath where DRHPC writes results
237+
# Default is /var/lib/oci-dr-hpc-v2 which matches drhpc.resultsHostPath
238+
drhpcResultsPath: "/var/lib/oci-dr-hpc-v2"
239+
240+
image:
241+
repository: phx.ocir.io/idnlixcmffxd/oke-public-node-problem-detector
242+
tag: v0.8.20.7
243+
sha256: sha256:399b506dbfa5c33e60a247d0d3199f025d242b7a7480c956446e70eaa090c599
244+
pullPolicy: Always
245+
246+
prometheusPort: 20257
247+
248+
# Node selector - defaults to OKE node-problem-detector label
249+
nodeSelector:
250+
oci.oraclecloud.com/oke-node-problem-detector-enabled: "true"
251+
252+
# TolenableGpuCheckserations for GPU nodes
253+
tolerations:
254+
- key: CriticalAddonsOnly
255+
operator: Exists
256+
- key: oci.oraclecloud.com/oke-is-preemptible
257+
operator: Exists
258+
- effect: NoSchedule
259+
key: nvidia.com/gpu
260+
operator: Exists
261+
- effect: NoSchedule
262+
key: amd.com/gpu
263+
operator: Exists
264+
- effect: NoSchedule
265+
key: oci.oraclecloud.com/node-auto-repair-scheduled
266+
operator: Exists
267+
268+
resources:
269+
requests:
270+
cpu: 10m
271+
memory: 80Mi
272+
limits:
273+
cpu: 10m
274+
memory: 80Mi

0 commit comments

Comments
 (0)