Skip to content

Commit 48e45ae

Browse files
gjulianmcswatt
andauthored
gpu: add instructions for non-k8s setups (#32760)
* Add non-k8s instructions * Update docker instructions * Update setup.md * Bump minimal version --------- Co-authored-by: cecilia saixue wat-kim <cecilia.watt@datadoghq.com>
1 parent 5bd46a2 commit 48e45ae

File tree

1 file changed

+155
-3
lines changed

1 file changed

+155
-3
lines changed

content/en/gpu_monitoring/setup.md

Lines changed: 155 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,16 +12,18 @@ To begin using Datadog's GPU Monitoring, your environment must meet the followin
1212

1313
#### Minimum version requirements
1414

15-
- **Datadog Agent**: version 7.70.1
15+
- **Datadog Agent**: version 7.72.2
1616
- [**Datadog Operator**][5]: version 1.18, _or_ [**Datadog Helm chart**][6]: version 3.137.3
1717
- **Operating system**: Linux
1818
- (Optional) For advanced eBPF metrics, Linux kernel version 5.8
1919
- **NVIDIA driver**: version 450.51
2020
- **Kubernetes**: 1.22 with PodResources API active
2121

22-
## Set up GPU Monitoring on a uniform cluster
22+
## Set up GPU Monitoring on a uniform cluster or non-Kubernetes environment
2323

24-
In a uniform cluster, all nodes have GPU devices.
24+
The following instructions are the basic steps to set up GPU Monitoring in the following environments:
25+
- In a Kubernetes cluster where **all** the nodes have GPU devices
26+
- In a non-Kubernetes environment, such as Docker or non-containerized Linux.
2527

2628
{{< tabs >}}
2729
{{% tab "Datadog Operator" %}}
@@ -97,6 +99,156 @@ In a uniform cluster, all nodes have GPU devices.
9799
[2]: https://github.com/DataDog/datadog-agent/releases
98100

99101
{{% /tab %}}
102+
103+
{{% tab "Docker" %}}
104+
105+
To enable GPU Monitoring in Docker without advanced eBPF metrics, use the following configuration when starting the container Agent:
106+
107+
```shell
108+
docker run \
109+
--pid host \
110+
--gpus all \
111+
-e DD_GPU_ENABLED=true \
112+
-v /var/run/docker.sock:/var/run/docker.sock:ro \
113+
-v /proc/:/host/proc/:ro \
114+
-v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
115+
gcr.io/datadoghq/agent:latest
116+
```
117+
118+
To enable advanced eBPF metrics, use the following configuration for the required permissions to run eBPF programs:
119+
120+
```shell
121+
docker run \
122+
--cgroupns host \
123+
--pid host \
124+
--gpus all \
125+
-e DD_API_KEY="<DATADOG_API_KEY>" \
126+
-e DD_GPU_MONITORING_ENABLED=true \
127+
-e DD_GPU_ENABLED=true \
128+
-v /:/host/root:ro \
129+
-v /var/run/docker.sock:/var/run/docker.sock:ro \
130+
-v /proc/:/host/proc/:ro \
131+
-v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
132+
-v /sys/kernel/debug:/sys/kernel/debug \
133+
-v /lib/modules:/lib/modules:ro \
134+
-v /usr/src:/usr/src:ro \
135+
-v /var/tmp/datadog-agent/system-probe/build:/var/tmp/datadog-agent/system-probe/build \
136+
-v /var/tmp/datadog-agent/system-probe/kernel-headers:/var/tmp/datadog-agent/system-probe/kernel-headers \
137+
-v /etc/apt:/host/etc/apt:ro \
138+
-v /etc/yum.repos.d:/host/etc/yum.repos.d:ro \
139+
-v /etc/zypp:/host/etc/zypp:ro \
140+
-v /etc/pki:/host/etc/pki:ro \
141+
-v /etc/yum/vars:/host/etc/yum/vars:ro \
142+
-v /etc/dnf/vars:/host/etc/dnf/vars:ro \
143+
-v /etc/rhsm:/host/etc/rhsm:ro \
144+
-e HOST_ROOT=/host/root \
145+
--security-opt apparmor:unconfined \
146+
--cap-add=SYS_ADMIN \
147+
--cap-add=SYS_RESOURCE \
148+
--cap-add=SYS_PTRACE \
149+
--cap-add=IPC_LOCK \
150+
--cap-add=CHOWN \
151+
gcr.io/datadoghq/agent:latest
152+
```
153+
154+
Replace `<DATADOG_API_KEY>` with your [Datadog API key][1].
155+
156+
[1]: https://app.datadoghq.com/organization-settings/api-keys
157+
158+
{{% /tab %}}
159+
{{% tab "Docker Compose" %}}
160+
161+
If using `docker-compose`, make the following additions to the Datadog Agent service.
162+
163+
```yaml
164+
version: '3'
165+
services:
166+
datadog:
167+
image: "gcr.io/datadoghq/agent:latest"
168+
environment:
169+
- DD_GPU_ENABLED=true
170+
- DD_API_KEY=<DATADOG_API_KEY>
171+
volumes:
172+
- /var/run/docker.sock:/var/run/docker.sock:ro
173+
- /proc/:/host/proc/:ro
174+
- /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
175+
deploy:
176+
resources:
177+
reservations:
178+
devices:
179+
- driver: nvidia
180+
count: all
181+
capabilities: [gpu]
182+
```
183+
184+
To enable advanced eBPF metrics, use the following configuration for the required permissions to run eBPF programs:
185+
186+
```yaml
187+
version: '3'
188+
services:
189+
datadog:
190+
image: "gcr.io/datadoghq/agent:latest"
191+
environment:
192+
- DD_GPU_MONITORING_ENABLED=true # only for advanced eBPF metrics
193+
- DD_GPU_ENABLED=true
194+
- DD_API_KEY=<DATADOG_API_KEY>
195+
- HOST_ROOT=/host/root
196+
volumes:
197+
- /var/run/docker.sock:/var/run/docker.sock:ro
198+
- /proc/:/host/proc/:ro
199+
- /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
200+
- /sys/kernel/debug:/sys/kernel/debug
201+
- /:/host/root
202+
cap_add:
203+
- SYS_ADMIN
204+
- SYS_RESOURCE
205+
- SYS_PTRACE
206+
- IPC_LOCK
207+
- CHOWN
208+
security_opt:
209+
- apparmor:unconfined
210+
deploy:
211+
resources:
212+
reservations:
213+
devices:
214+
- driver: nvidia
215+
count: all
216+
capabilities: [gpu]
217+
```
218+
219+
{{% /tab %}}
220+
{{% tab "Linux (non-containerized)" %}}
221+
222+
Modify your `/etc/datadog-agent/datadog.yaml` file to enable GPU monitoring
223+
224+
```yaml
225+
gpu:
226+
enabled: true
227+
```
228+
229+
To enable advanced eBPF metrics, follow these steps:
230+
231+
1. If `/etc/datadog-agent/system-probe.yaml` does not exist, create it from `system-probe.yaml.example`:
232+
233+
```shell
234+
sudo -u dd-agent install -m 0640 /etc/datadog-agent/system-probe.yaml.example /etc/datadog-agent/system-probe.yaml
235+
```
236+
237+
2. Edit `/etc/datadog-agent/system-probe.yaml` and enable GPU monitoring in system-probe:
238+
239+
```yaml
240+
gpu_monitoring:
241+
enabled: true
242+
```
243+
244+
3. Restart the Datadog Agent
245+
246+
```shell
247+
sudo systemctl restart datadog-agent
248+
```
249+
250+
{{% /tab %}}
251+
100252
{{< /tabs >}}
101253

102254
## Set up GPU Monitoring on a mixed cluster

0 commit comments

Comments
 (0)