[improvement] : remove driver upgrade label from nodes#2123
[improvement] : remove driver upgrade label from nodes#2123rahulait wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds functionality to clean up stale upgrade labels from nodes that no longer have driver pods scheduled on them. This addresses the scenario where nodeSelector changes cause driver pods to be removed from certain nodes, but the upgrade state labels remain on those nodes.
Changes:
- Added
clearUpgradeLabelsWhereDriverNotRunningfunction to remove upgrade labels from nodes without driver pods - Integrated the cleanup function into the main Reconcile loop as a best-effort operation
- The cleanup skips nodes actively being managed by the upgrade process to avoid interference
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
530ced1 to
f5b193c
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f5b193c to
604eaa6
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
0d17a35 to
b2f136d
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
b2f136d to
f25f905
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f25f905 to
aefe980
Compare
this commit removes driver upgrade label from nodes which don't have any driver pod running on them Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>
aefe980 to
a288d8c
Compare
|
/ok to test a288d8c |
|
|
||
| baseClient := fake.NewClientBuilder(). | ||
| WithScheme(scheme). | ||
| WithObjects(objs...). |
There was a problem hiding this comment.
Consider adding an interceptor here to track patched nodes instead of a new patchTrackingClient.
patchedNodes := map[string]*corev1.Node{}
baseClient := fake.NewClientBuilder().
WithScheme(scheme).
WithObjects(objs...).
WithInterceptorFuncs(interceptor.Funcs{
Patch: func(ctx context.Context, client client.WithWatch, obj client.Object, patch client.Patch, opts ...client.PatchOption) error {
if node, ok := obj.(*corev1.Node); ok {
patchedNodes[node.Name] = node.DeepCopy()
}
return client.Patch(ctx, obj, patch, opts...)
}}).
Build()| _ = corev1.AddToScheme(scheme) | ||
|
|
||
| // Create a client that will fail on List operations | ||
| fakeClient := &errorClient{ |
There was a problem hiding this comment.
Same here. Build a regular fake client with an interceptor that returns an error on the operations you want. But my preference is to remove this test case. This is just testing the function returns an error when the client can't list objects; doesn't seem valuable to me.
this commit removes driver upgrade label from nodes which don't have any driver pod running on them
Description
This PR adds automatic cleanup of stale upgrade labels from nodes where GPU driver pods are no longer scheduled, addressing the edge case where
nodeSelectorchanges cause pods to terminate but leavenvidia.com/gpu-driver-upgrade-statelabels behind.Problem
When a
NVIDIADriverCR'snodeSelectoris updated:nvidia.com/gpu-driver-upgrade-state) remain on those nodes indefinitelyExample scenario:
Nodes without
nvidia.com/gpu-type: "A100"lose their driver pods but keep upgrade labels.Solution
Added
clearUpgradeLabelsWhereDriverNotRunning()function to the upgrade controller that:Checklist
make lint)make validate-generated-assets)make validate-modules)Testing