chore(e2e): trying new listWatch logic for WaitNodeReady#7826
chore(e2e): trying new listWatch logic for WaitNodeReady#7826
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the e2e kube client’s WaitUntilNodeReady logic to be more resilient to apiserver/watch disconnects by switching from a single long-running watch to a List+Watch retry loop.
Changes:
- Added a small helper (
isNodeReady) to centralize node match/ready evaluation. - Reworked
WaitUntilNodeReadyto repeatedly List nodes and then Watch from the list’sresourceVersion, restarting on watch closure. - Removed
testify/requireusage from this file.
| "github.com/stretchr/testify/require" | ||
| appsv1 "k8s.io/api/apps/v1" | ||
| corev1 "k8s.io/api/core/v1" | ||
| v1 "k8s.io/api/core/v1" |
There was a problem hiding this comment.
The import block includes both corev1 and v1 aliases for the same package path (k8s.io/api/core/v1), which will not compile in Go (duplicate import). Drop one alias and update the few call sites accordingly (e.g., use corev1.Secret / corev1.SecretTypeDockerConfigJson / corev1.DockerConfigJsonKey).
| v1 "k8s.io/api/core/v1" |
| case event, ok := <-watcher.ResultChan(): | ||
| if !ok { | ||
| // Watch channel closed, restart the ListWatch | ||
| t.Logf("watch closed, restarting ListWatch") | ||
| break watchLoop | ||
| } |
There was a problem hiding this comment.
When watcher.ResultChan() is closed (ok == false), the code breaks out to restart the ListWatch but never calls watcher.Stop(). Even if the channel is closed, calling Stop() is the safe/idiomatic way to release the watch and avoid lingering goroutines/resources before retrying.
| // Watch channel closed, restart the ListWatch | ||
| t.Logf("watch closed, restarting ListWatch") |
There was a problem hiding this comment.
If the watch is closed immediately/repeatedly (e.g., apiserver timeouts, transient network issues), the outer loop will re-List and re-Watch in a tight loop with no delay on the ok == false path, which can hammer the apiserver. Add a small sleep/backoff when restarting after a closed watch (and consider exponential backoff / RetryWatcher).
| // Watch channel closed, restart the ListWatch | |
| t.Logf("watch closed, restarting ListWatch") | |
| // Watch channel closed, restart the ListWatch after a short delay | |
| t.Logf("watch closed, restarting ListWatch after short delay") | |
| watcher.Stop() | |
| time.Sleep(500 * time.Millisecond) |
Devinwong
left a comment
There was a problem hiding this comment.
Approved. It's always good when it's trying to make api server connection less flaky
| default: | ||
| t.Logf("skipping object type %T", event.Object) | ||
| // Step 1: List existing nodes to get current state and resourceVersion | ||
| nodeList, err := k.Typed.CoreV1().Nodes().List(ctx, metav1.ListOptions{}) |
There was a problem hiding this comment.
I think LIST+WATCH pattern been important for older versions. Where WATCH didn't produce initial events.
There is SendInitialEvents that seems to be set to True by default.
I don't think you need LIST.
What this PR does / why we need it:
Try to augment the waitNodeReady logic to reconnect when the connection to the apiserer breaks
Which issue(s) this PR fixes:
Fixes #