Skip to content

chore: add diagnostic info for nvidia-device-plugin, dcgm is active or not#7858

Open
sulixu wants to merge 3 commits intomainfrom
gpu-systemd-verify
Open

chore: add diagnostic info for nvidia-device-plugin, dcgm is active or not#7858
sulixu wants to merge 3 commits intomainfrom
gpu-systemd-verify

Conversation

@sulixu
Copy link

@sulixu sulixu commented Feb 11, 2026

What this PR does / why we need it:

observed some node have these 3 systemd installed, but not active, adding diagnostic logging info so engineers can check each service's actual status via kusto log and get some insight before any support ticket.

CSE logs will show:
AKS.CSE.verify.nvidia-device-plugin.active Completed: echo service nvidia-device-plugin is active
AKS.CSE.verify.nvidia-dcgm.active Completed: echo service nvidia-dcgm is active
AKS.CSE.verify.nvidia-dcgm-exporter.active Completed: echo service nvidia-dcgm-exporter is active

Or if not active:
AKS.CSE.verify.nvidia-device-plugin.inactive Completed: echo service nvidia-device-plugin is inactive. systemctl: . journal:

Which issue(s) this PR fixes:

Fixes #

Copilot AI review requested due to automatic review settings February 11, 2026 23:49
@sulixu sulixu changed the title add diagnostic info for nvidia-device-plugin, dcgm is active or not chore: add diagnostic info for nvidia-device-plugin, dcgm is active or not Feb 11, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds diagnostic logging for NVIDIA managed experience services (nvidia-device-plugin, nvidia-dcgm, nvidia-dcgm-exporter) to help engineers troubleshoot GPU node issues. After starting each service, the new verifyNvidiaServiceAndLog function checks if each service is active and logs the status to the events pipeline. If a service is not active, it collects diagnostic information (systemctl status + journal logs) which will be available in Kusto logs for debugging support tickets.

Changes:

  • Added verifyNvidiaServiceAndLog() function to verify and log NVIDIA service status
  • Added three verification calls after starting NVIDIA services in startNvidiaManagedExpServices()

…ging

The previous commit added the verifyNvidiaServiceAndLog function to
cse_config.sh but the testdata snapshots were not properly regenerated.
Copilot AI review requested due to automatic review settings February 12, 2026 18:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 34 out of 66 changed files in this pull request and generated 2 comments.

Comment on lines +1320 to +1321
# Use timeout to prevent hanging; default to "unknown" on any failure
status=$(timeout ${timeout_seconds} systemctl is-active "$service" 2>/dev/null) || status="unknown"
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

systemctl is-active returns a non-zero exit code for inactive/failed services even though it prints the status string. Because the assignment is followed by || status="unknown", any non-active state will get overwritten to unknown, so you'll never emit ...inactive/...failed task names as described in the PR. Capture the output without treating the exit code as a failure (e.g., ignore the exit status inside the command substitution, and only default to unknown when the output is empty or the command times out).

Suggested change
# Use timeout to prevent hanging; default to "unknown" on any failure
status=$(timeout ${timeout_seconds} systemctl is-active "$service" 2>/dev/null) || status="unknown"
# Use timeout to prevent hanging; capture textual status even if exit code is non-zero.
# Default to "unknown" only if no status output is produced (e.g., timeout or hard failure).
status=$(timeout ${timeout_seconds} systemctl is-active "$service" 2>/dev/null || true)
if [ -z "$status" ]; then
status="unknown"
fi

Copilot uses AI. Check for mistakes.
Comment on lines +1332 to +1336
systemctl_output=$(timeout ${timeout_seconds} systemctl status "$service" --no-pager -l 2>&1) || systemctl_output="timeout or error getting status"
journal_output=$(timeout ${timeout_seconds} journalctl -u "$service" -n 50 --no-pager 2>&1) || journal_output="timeout or error getting journal"

# Log with status in task name and diagnostic info echoed
logs_to_events "AKS.CSE.verify.${service}.${status}" "echo service $service is $status. systemctl: ${systemctl_output}. journal: ${journal_output}"
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logs_to_events payload here inlines full systemctl status and journalctl output into the event Message. Those outputs can be very large and contain lots of punctuation/newlines; combined with logs_to_events using unquoted ${@} / $*, this risks excessive word-splitting/globbing, very large JSON event files, and potentially truncated/failed ingestion. Consider truncating/sanitizing the captured output (e.g., replace newlines with spaces, cap to a few KB / last N lines), or write diagnostics to a file and log only the file path plus a short summary in the event.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants