chore: add diagnostic info for nvidia-device-plugin, dcgm is active or not#7858
chore: add diagnostic info for nvidia-device-plugin, dcgm is active or not#7858
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds diagnostic logging for NVIDIA managed experience services (nvidia-device-plugin, nvidia-dcgm, nvidia-dcgm-exporter) to help engineers troubleshoot GPU node issues. After starting each service, the new verifyNvidiaServiceAndLog function checks if each service is active and logs the status to the events pipeline. If a service is not active, it collects diagnostic information (systemctl status + journal logs) which will be available in Kusto logs for debugging support tickets.
Changes:
- Added
verifyNvidiaServiceAndLog()function to verify and log NVIDIA service status - Added three verification calls after starting NVIDIA services in
startNvidiaManagedExpServices()
…lude cse_config.sh
…ging The previous commit added the verifyNvidiaServiceAndLog function to cse_config.sh but the testdata snapshots were not properly regenerated.
| # Use timeout to prevent hanging; default to "unknown" on any failure | ||
| status=$(timeout ${timeout_seconds} systemctl is-active "$service" 2>/dev/null) || status="unknown" |
There was a problem hiding this comment.
systemctl is-active returns a non-zero exit code for inactive/failed services even though it prints the status string. Because the assignment is followed by || status="unknown", any non-active state will get overwritten to unknown, so you'll never emit ...inactive/...failed task names as described in the PR. Capture the output without treating the exit code as a failure (e.g., ignore the exit status inside the command substitution, and only default to unknown when the output is empty or the command times out).
| # Use timeout to prevent hanging; default to "unknown" on any failure | |
| status=$(timeout ${timeout_seconds} systemctl is-active "$service" 2>/dev/null) || status="unknown" | |
| # Use timeout to prevent hanging; capture textual status even if exit code is non-zero. | |
| # Default to "unknown" only if no status output is produced (e.g., timeout or hard failure). | |
| status=$(timeout ${timeout_seconds} systemctl is-active "$service" 2>/dev/null || true) | |
| if [ -z "$status" ]; then | |
| status="unknown" | |
| fi |
| systemctl_output=$(timeout ${timeout_seconds} systemctl status "$service" --no-pager -l 2>&1) || systemctl_output="timeout or error getting status" | ||
| journal_output=$(timeout ${timeout_seconds} journalctl -u "$service" -n 50 --no-pager 2>&1) || journal_output="timeout or error getting journal" | ||
|
|
||
| # Log with status in task name and diagnostic info echoed | ||
| logs_to_events "AKS.CSE.verify.${service}.${status}" "echo service $service is $status. systemctl: ${systemctl_output}. journal: ${journal_output}" |
There was a problem hiding this comment.
The logs_to_events payload here inlines full systemctl status and journalctl output into the event Message. Those outputs can be very large and contain lots of punctuation/newlines; combined with logs_to_events using unquoted ${@} / $*, this risks excessive word-splitting/globbing, very large JSON event files, and potentially truncated/failed ingestion. Consider truncating/sanitizing the captured output (e.g., replace newlines with spaces, cap to a few KB / last N lines), or write diagnostics to a file and log only the file path plus a short summary in the event.
What this PR does / why we need it:
observed some node have these 3 systemd installed, but not active, adding diagnostic logging info so engineers can check each service's actual status via kusto log and get some insight before any support ticket.
CSE logs will show:
AKS.CSE.verify.nvidia-device-plugin.active Completed: echo service nvidia-device-plugin is active
AKS.CSE.verify.nvidia-dcgm.active Completed: echo service nvidia-dcgm is active
AKS.CSE.verify.nvidia-dcgm-exporter.active Completed: echo service nvidia-dcgm-exporter is active
Or if not active:
AKS.CSE.verify.nvidia-device-plugin.inactive Completed: echo service nvidia-device-plugin is inactive. systemctl: . journal:
Which issue(s) this PR fixes:
Fixes #