chore(e2e): trying new listWatch logic for WaitNodeReady by djsly · Pull Request #7826 · Azure/AgentBaker

djsly · 2026-02-07T01:12:15Z

What this PR does / why we need it:
Try to augment the waitNodeReady logic to reconnect when the connection to the apiserer breaks
Which issue(s) this PR fixes:

Fixes #

Copilot

Pull request overview

This PR updates the e2e kube client’s WaitUntilNodeReady logic to be more resilient to apiserver/watch disconnects by switching from a single long-running watch to a List+Watch retry loop.

Changes:

Added a small helper (isNodeReady) to centralize node match/ready evaluation.
Reworked WaitUntilNodeReady to repeatedly List nodes and then Watch from the list’s resourceVersion, restarting on watch closure.
Removed testify/require usage from this file.

Copilot · 2026-02-07T01:16:26Z

e2e/kube.go

-	"github.com/stretchr/testify/require"
 	appsv1 "k8s.io/api/apps/v1"
 	corev1 "k8s.io/api/core/v1"
 	v1 "k8s.io/api/core/v1"


The import block includes both corev1 and v1 aliases for the same package path (k8s.io/api/core/v1), which will not compile in Go (duplicate import). Drop one alias and update the few call sites accordingly (e.g., use corev1.Secret / corev1.SecretTypeDockerConfigJson / corev1.DockerConfigJsonKey).

Suggested change

v1 "k8s.io/api/core/v1"

Copilot · 2026-02-07T01:16:27Z

e2e/kube.go

+			case event, ok := <-watcher.ResultChan():
+				if !ok {
+					// Watch channel closed, restart the ListWatch
+					t.Logf("watch closed, restarting ListWatch")
+					break watchLoop
+				}


When watcher.ResultChan() is closed (ok == false), the code breaks out to restart the ListWatch but never calls watcher.Stop(). Even if the channel is closed, calling Stop() is the safe/idiomatic way to release the watch and avoid lingering goroutines/resources before retrying.

Copilot · 2026-02-07T01:16:27Z

e2e/kube.go

+					// Watch channel closed, restart the ListWatch
+					t.Logf("watch closed, restarting ListWatch")


If the watch is closed immediately/repeatedly (e.g., apiserver timeouts, transient network issues), the outer loop will re-List and re-Watch in a tight loop with no delay on the ok == false path, which can hammer the apiserver. Add a small sleep/backoff when restarting after a closed watch (and consider exponential backoff / RetryWatcher).

Suggested change

// Watch channel closed, restart the ListWatch

t.Logf("watch closed, restarting ListWatch")

// Watch channel closed, restart the ListWatch after a short delay

t.Logf("watch closed, restarting ListWatch after short delay")

watcher.Stop()

time.Sleep(500 * time.Millisecond)

Devinwong

Approved. It's always good when it's trying to make api server connection less flaky

r2k1 · 2026-02-07T19:48:25Z

e2e/kube.go

-		default:
-			t.Logf("skipping object type %T", event.Object)
+		// Step 1: List existing nodes to get current state and resourceVersion
+		nodeList, err := k.Typed.CoreV1().Nodes().List(ctx, metav1.ListOptions{})


I think LIST+WATCH pattern been important for older versions. Where WATCH didn't produce initial events.

There is SendInitialEvents that seems to be set to True by default.
I don't think you need LIST.

trying new listWatch logic for WaitNodeReady

ca12c8c

Copilot AI review requested due to automatic review settings February 7, 2026 01:12

djsly requested review from AbelHu, Devinwong, awesomenix, cameronmeissner, ganeshkumarashok, juan-lee, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, timmy-wright and zachary-bailey as code owners February 7, 2026 01:12

djsly temporarily deployed to test February 7, 2026 01:12 — with GitHub Actions Inactive

Copilot started reviewing on behalf of djsly February 7, 2026 01:12 View session

Copilot AI reviewed Feb 7, 2026

View reviewed changes

Devinwong approved these changes Feb 7, 2026

View reviewed changes

r2k1 reviewed Feb 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(e2e): trying new listWatch logic for WaitNodeReady#7826

chore(e2e): trying new listWatch logic for WaitNodeReady#7826
djsly wants to merge 1 commit intomainfrom
djsly/reconnect-watch-break

djsly commented Feb 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Devinwong left a comment

Uh oh!

r2k1 Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// Watch channel closed, restart the ListWatch
		t.Logf("watch closed, restarting ListWatch")

-					// Watch channel closed, restart the ListWatch
-					t.Logf("watch closed, restarting ListWatch")
+					// Watch channel closed, restart the ListWatch after a short delay
+					t.Logf("watch closed, restarting ListWatch after short delay")
+					watcher.Stop()
+					time.Sleep(500 * time.Millisecond)

Conversation

djsly commented Feb 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Devinwong left a comment

Choose a reason for hiding this comment

Uh oh!

r2k1 Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants