chore: add diagnostic info for nvidia-device-plugin, dcgm is active or not by sulixu · Pull Request #7858 · Azure/AgentBaker

sulixu · 2026-02-11T23:49:08Z

What this PR does / why we need it:

observed some node have these 3 systemd installed, but not active, adding diagnostic logging info so engineers can check each service's actual status via kusto log and get some insight before any support ticket.

CSE logs will show:
AKS.CSE.verify.nvidia-device-plugin.active Completed: echo service nvidia-device-plugin is active
AKS.CSE.verify.nvidia-dcgm.active Completed: echo service nvidia-dcgm is active
AKS.CSE.verify.nvidia-dcgm-exporter.active Completed: echo service nvidia-dcgm-exporter is active

Or if not active:
AKS.CSE.verify.nvidia-device-plugin.inactive Completed: echo service nvidia-device-plugin is inactive. systemctl: . journal:

Which issue(s) this PR fixes:

Fixes #

Copilot

Pull request overview

This PR adds diagnostic logging for NVIDIA managed experience services (nvidia-device-plugin, nvidia-dcgm, nvidia-dcgm-exporter) to help engineers troubleshoot GPU node issues. After starting each service, the new verifyNvidiaServiceAndLog function checks if each service is active and logs the status to the events pipeline. If a service is not active, it collects diagnostic information (systemctl status + journal logs) which will be available in Kusto logs for debugging support tickets.

Changes:

Added verifyNvidiaServiceAndLog() function to verify and log NVIDIA service status
Added three verification calls after starting NVIDIA services in startNvidiaManagedExpServices()

…lude cse_config.sh

…ging The previous commit added the verifyNvidiaServiceAndLog function to cse_config.sh but the testdata snapshots were not properly regenerated.

Copilot

Pull request overview

Copilot reviewed 34 out of 66 changed files in this pull request and generated 2 comments.

Copilot · 2026-02-12T18:12:58Z

parts/linux/cloud-init/artifacts/cse_config.sh

+    # Use timeout to prevent hanging; default to "unknown" on any failure
+    status=$(timeout ${timeout_seconds} systemctl is-active "$service" 2>/dev/null) || status="unknown"


systemctl is-active returns a non-zero exit code for inactive/failed services even though it prints the status string. Because the assignment is followed by || status="unknown", any non-active state will get overwritten to unknown, so you'll never emit ...inactive/...failed task names as described in the PR. Capture the output without treating the exit code as a failure (e.g., ignore the exit status inside the command substitution, and only default to unknown when the output is empty or the command times out).

Suggested change

# Use timeout to prevent hanging; default to "unknown" on any failure

status=$(timeout ${timeout_seconds} systemctl is-active "$service" 2>/dev/null) || status="unknown"

# Use timeout to prevent hanging; capture textual status even if exit code is non-zero.

# Default to "unknown" only if no status output is produced (e.g., timeout or hard failure).

status=$(timeout ${timeout_seconds} systemctl is-active "$service" 2>/dev/null || true)

if [ -z "$status" ]; then

status="unknown"

fi

Copilot · 2026-02-12T18:12:59Z

parts/linux/cloud-init/artifacts/cse_config.sh

+    systemctl_output=$(timeout ${timeout_seconds} systemctl status "$service" --no-pager -l 2>&1) || systemctl_output="timeout or error getting status"
+    journal_output=$(timeout ${timeout_seconds} journalctl -u "$service" -n 50 --no-pager 2>&1) || journal_output="timeout or error getting journal"
+
+    # Log with status in task name and diagnostic info echoed
+    logs_to_events "AKS.CSE.verify.${service}.${status}" "echo service $service is $status. systemctl: ${systemctl_output}. journal: ${journal_output}"


The logs_to_events payload here inlines full systemctl status and journalctl output into the event Message. Those outputs can be very large and contain lots of punctuation/newlines; combined with logs_to_events using unquoted ${@} / $*, this risks excessive word-splitting/globbing, very large JSON event files, and potentially truncated/failed ingestion. Consider truncating/sanitizing the captured output (e.g., replace newlines with spaces, cap to a few KB / last N lines), or write diagnostics to a file and log only the file path plus a short summary in the event.

Update GPU systemd verification script

c02773a

Copilot AI review requested due to automatic review settings February 11, 2026 23:49

sulixu requested review from AbelHu, Devinwong, awesomenix, cameronmeissner, djsly, ganeshkumarashok, juan-lee, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, timmy-wright and zachary-bailey as code owners February 11, 2026 23:49

sulixu temporarily deployed to test February 11, 2026 23:49 — with GitHub Actions Inactive

Copilot started reviewing on behalf of sulixu February 11, 2026 23:49 View session

sulixu changed the title ~~add diagnostic info for nvidia-device-plugin, dcgm is active or not~~ chore: add diagnostic info for nvidia-device-plugin, dcgm is active or not Feb 11, 2026

Copilot AI reviewed Feb 11, 2026

View reviewed changes

update all the customData testdata files changed because they all inc…

f459ba9

…lude cse_config.sh

sulixu requested review from YaoC and yewmsft as code owners February 12, 2026 17:56

sulixu temporarily deployed to test February 12, 2026 17:56 — with GitHub Actions Inactive

fix: regenerate testdata after adding NVIDIA service verification log…

cfb2877

…ging The previous commit added the verifyNvidiaServiceAndLog function to cse_config.sh but the testdata snapshots were not properly regenerated.

Copilot AI review requested due to automatic review settings February 12, 2026 18:08

sulixu temporarily deployed to test February 12, 2026 18:08 — with GitHub Actions Inactive

Copilot started reviewing on behalf of sulixu February 12, 2026 18:09 View session

Copilot AI reviewed Feb 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: add diagnostic info for nvidia-device-plugin, dcgm is active or not#7858

chore: add diagnostic info for nvidia-device-plugin, dcgm is active or not#7858
sulixu wants to merge 3 commits intomainfrom
gpu-systemd-verify

sulixu commented Feb 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Use timeout to prevent hanging; default to "unknown" on any failure
		status=$(timeout ${timeout_seconds} systemctl is-active "$service" 2>/dev/null) \|\| status="unknown"

-    # Use timeout to prevent hanging; default to "unknown" on any failure
-    status=$(timeout ${timeout_seconds} systemctl is-active "$service" 2>/dev/null) || status="unknown"
+    # Use timeout to prevent hanging; capture textual status even if exit code is non-zero.
+    # Default to "unknown" only if no status output is produced (e.g., timeout or hard failure).
+    status=$(timeout ${timeout_seconds} systemctl is-active "$service" 2>/dev/null || true)
+    if [ -z "$status" ]; then
+        status="unknown"
+    fi

Conversation

sulixu commented Feb 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants