-
Notifications
You must be signed in to change notification settings - Fork 760
[WIP] Enhance health check robustness and observability #1554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[WIP] Enhance health check robustness and observability #1554
Conversation
Improve the device health check system to prevent blocking, enable graceful shutdown, and provide better error categorization. These changes address stability issues in production environments with multiple GPUs and bursty XID error scenarios. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add buffered channels (64), non-blocking writes, graceful shutdown, stats collection, and automatic device recovery detection (30s). Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Encapsulate health checking state into dedicated struct to improve modularity and testability. This struct groups related data (device maps, XID filtering, stats) and will enable focused methods for device registration and event monitoring. No behavior changes - struct is defined but not yet used. Inspired by elezar/refactor-health approach. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Separate device registration logic into a focused method on nvmlHealthProvider. This improves testability by allowing device registration to be tested independently from the event monitoring loop. The method handles: - Getting device handles by UUID - Checking supported event types - Registering events with the event set - Marking devices unhealthy on registration failures Inspired by elezar/refactor-health (a6a9f18). Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Separate the event monitoring loop into a focused method on nvmlHealthProvider. This preserves all robustness features: - Context-based shutdown coordination - Buffered event channel with goroutine receiver - Granular error handling via callback - Stats tracking for observability - XID filtering - MIG device support The method is now testable independently from NVML initialization and device registration. Error handling is injected as a callback to maintain flexibility. Inspired by elezar/refactor-health (a6a9f18). Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Improve documentation of checkHealth to clarify its role as the main orchestrator that coordinates: - NVML initialization and resource management - Device placement mapping (MIG support) - Health provider creation and configuration - Event registration and monitoring - Shutdown coordination and stats reporting The function is now much more readable with clear delegation to focused methods. All functionality preserved. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add comprehensive unit test coverage for XID filtering: Test Coverage: - XID parsing logic (newHealthCheckXIDs) - 10 test cases - XID filtering with environment variables - 5 test cases - Default ignored XIDs validation - Environment variable override behavior Key Features: - Tests XID filtering (13, 31, 43, 45, 68, 109 filtered by default) - Validates 'all' and 'xids' keywords - Verifies enabled overrides disabled - All tests pass with -race flag Inspired by elezar/refactor-health (dab53b9). Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
4749eae to
79e665e
Compare
| // CheckDeviceHealth performs a simple health check on a single device by | ||
| // verifying it can be accessed via NVML and responds to basic queries. | ||
| // This is used for recovery detection - if a previously unhealthy device | ||
| // passes this check, it's considered recovered. We intentionally keep this | ||
| // simple and don't try to classify XIDs as recoverable vs permanent - that's | ||
| // controlled via DP_DISABLE_HEALTHCHECKS / DP_ENABLE_HEALTHCHECKS env vars. | ||
| func (r *nvmlResourceManager) CheckDeviceHealth(d *Device) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree with this mechanism for transitioning the device back to healthy. This is an oversimplification and will lead to unhealthy devices being considered healthy.
For example, if a device becomes unhealth due to repteated ECC memory errors, it is LIKELY that query functions such as the device name will continue to succeed and result in the device being marked as healthy when it needs a RESET.
Before we add this logic to the device plugin let us properly define and agree upon how we are detecting health.
Futhermore, although the XID-based health checking is something that is a means to an end, our ideal state is that some other component decides whether a device is health and the device plugin responds to these signals. Defining the unhealthy -> healthy transition here goes against this premise.
| return &x | ||
| } | ||
|
|
||
| func TestTriggerDeviceListUpdate_Phase2(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a matter of interest, what is Phase2? (Were these tests generated?)
| // nvmlHealthProvider encapsulates the state and logic for NVML-based GPU | ||
| // health monitoring. This struct groups related data and provides focused | ||
| // methods for device registration and event monitoring. | ||
| type nvmlHealthProvider struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Why is the refactoring done AFTER the functional changes in this PR?
| stats *healthCheckStats | ||
| } | ||
|
|
||
| // registerDeviceEvents registers NVML event handlers for all devices in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this actually different from the changes proposed in a6a9f18?
| if result.ret == nvml.ERROR_TIMEOUT { | ||
| continue | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we even send the event in the case of a timeout?
| // Try to send event result, but respect context cancellation | ||
| select { | ||
| case <-ctx.Done(): | ||
| return | ||
| case eventChan <- eventResult{event: e, ret: ret}: | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like the wrong way to try and ensure that the context has not been closed before sending to the event channel. What are we concerned about here? Is there a better way to ensure that this go routine terminates on the context being cancelled and doesn't block permenantly on the send?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit message mentions adding tests, but I only see code being removed here.
Improve the device health check system to prevent blocking, enable
graceful shutdown, and provide better error categorization. These
changes address stability issues in production environments with
multiple GPUs and bursty XID error scenarios.