Skip to content

chore(e2e): trying new listWatch logic for WaitNodeReady#7826

Open
djsly wants to merge 1 commit intomainfrom
djsly/reconnect-watch-break
Open

chore(e2e): trying new listWatch logic for WaitNodeReady#7826
djsly wants to merge 1 commit intomainfrom
djsly/reconnect-watch-break

Conversation

@djsly
Copy link
Collaborator

@djsly djsly commented Feb 7, 2026

What this PR does / why we need it:
Try to augment the waitNodeReady logic to reconnect when the connection to the apiserer breaks
Which issue(s) this PR fixes:

Fixes #

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the e2e kube client’s WaitUntilNodeReady logic to be more resilient to apiserver/watch disconnects by switching from a single long-running watch to a List+Watch retry loop.

Changes:

  • Added a small helper (isNodeReady) to centralize node match/ready evaluation.
  • Reworked WaitUntilNodeReady to repeatedly List nodes and then Watch from the list’s resourceVersion, restarting on watch closure.
  • Removed testify/require usage from this file.

"github.com/stretchr/testify/require"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
v1 "k8s.io/api/core/v1"
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import block includes both corev1 and v1 aliases for the same package path (k8s.io/api/core/v1), which will not compile in Go (duplicate import). Drop one alias and update the few call sites accordingly (e.g., use corev1.Secret / corev1.SecretTypeDockerConfigJson / corev1.DockerConfigJsonKey).

Suggested change
v1 "k8s.io/api/core/v1"

Copilot uses AI. Check for mistakes.
Comment on lines +219 to +224
case event, ok := <-watcher.ResultChan():
if !ok {
// Watch channel closed, restart the ListWatch
t.Logf("watch closed, restarting ListWatch")
break watchLoop
}
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When watcher.ResultChan() is closed (ok == false), the code breaks out to restart the ListWatch but never calls watcher.Stop(). Even if the channel is closed, calling Stop() is the safe/idiomatic way to release the watch and avoid lingering goroutines/resources before retrying.

Copilot uses AI. Check for mistakes.
Comment on lines +221 to +222
// Watch channel closed, restart the ListWatch
t.Logf("watch closed, restarting ListWatch")
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the watch is closed immediately/repeatedly (e.g., apiserver timeouts, transient network issues), the outer loop will re-List and re-Watch in a tight loop with no delay on the ok == false path, which can hammer the apiserver. Add a small sleep/backoff when restarting after a closed watch (and consider exponential backoff / RetryWatcher).

Suggested change
// Watch channel closed, restart the ListWatch
t.Logf("watch closed, restarting ListWatch")
// Watch channel closed, restart the ListWatch after a short delay
t.Logf("watch closed, restarting ListWatch after short delay")
watcher.Stop()
time.Sleep(500 * time.Millisecond)

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@Devinwong Devinwong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. It's always good when it's trying to make api server connection less flaky

default:
t.Logf("skipping object type %T", event.Object)
// Step 1: List existing nodes to get current state and resourceVersion
nodeList, err := k.Typed.CoreV1().Nodes().List(ctx, metav1.ListOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think LIST+WATCH pattern been important for older versions. Where WATCH didn't produce initial events.

There is SendInitialEvents that seems to be set to True by default.
I don't think you need LIST.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants