Two non-blocking follow-ups from the final cross-review of #216 (replace-pod task). Both are correctness/observability hardening on the same pair of tasks.
1. envtest harness for cache-lag regression coverage
Background. #216 fixed a same-reconcile read-after-write race where replace-pod and observe-image read the StatefulSet through the cached client (mgr.GetClient()) immediately after apply-statefulset SSA-wrote it. The cache-lag could return pre-apply state and cause replace-pod to no-op (deadlocking the chain-upgrade rollout). The fix plumbed mgr.GetAPIReader() through ExecutionConfig and switched both task reads to APIReader.Get.
Gap. Unit tests use sigs.k8s.io/controller-runtime/pkg/client/fake, which is synchronous and has no informer cache. So the unit tests cannot distinguish KubeClient.Get from APIReader.Get — a future regression that flips one back to KubeClient would pass the suite.
Ask. Add an envtest harness that:
- Spins up a real apiserver + informer cache (controller-runtime's
envtest package).
- Drives
apply-statefulset followed immediately by replace-pod / observe-image in the same reconcile pass, before the watch event has had a chance to update the cache.
- Asserts the second task observes the post-apply spec (i.e., cache-lag does not produce false-completion).
This is the only way to lock in the fix at the test layer.
2. Escalate STS NotFound to Terminal in replace-pod and observe-image
Current behavior. Both tasks treat apierrors.IsNotFound on the StatefulSet Get as a transient wait (return nil; caller re-runs). Rationale was that apply-statefulset precedes them in the plan, so a NotFound is just informer-cache lag and the next reconcile will see it.
Concern. With APIReader plumbed in (#216), the read is strongly consistent. A NotFound now genuinely means the StatefulSet is missing — not a cache artifact. The most plausible explanation is external deletion mid-rollout (operator action, GC, namespace teardown). In that case the plan will spin forever returning ExecutionRunning, with no signal to the user.
Ask. Switch both NotFound paths to Terminal(fmt.Errorf(\"statefulset %q missing during rollout (apply-statefulset preceded us); external deletion?\", name)). The planner already surfaces Terminal errors as plan failure with a Condition; this turns a silent stall into an actionable surface.
Out of scope. Init plans where the STS is created later in the same plan — this only affects NodeUpdate plans where apply-statefulset runs upstream of these two tasks.
References
Two non-blocking follow-ups from the final cross-review of #216 (replace-pod task). Both are correctness/observability hardening on the same pair of tasks.
1. envtest harness for cache-lag regression coverage
Background. #216 fixed a same-reconcile read-after-write race where
replace-podandobserve-imageread the StatefulSet through the cached client (mgr.GetClient()) immediately afterapply-statefulsetSSA-wrote it. The cache-lag could return pre-apply state and causereplace-podto no-op (deadlocking the chain-upgrade rollout). The fix plumbedmgr.GetAPIReader()throughExecutionConfigand switched both task reads toAPIReader.Get.Gap. Unit tests use
sigs.k8s.io/controller-runtime/pkg/client/fake, which is synchronous and has no informer cache. So the unit tests cannot distinguishKubeClient.GetfromAPIReader.Get— a future regression that flips one back toKubeClientwould pass the suite.Ask. Add an envtest harness that:
envtestpackage).apply-statefulsetfollowed immediately byreplace-pod/observe-imagein the same reconcile pass, before the watch event has had a chance to update the cache.This is the only way to lock in the fix at the test layer.
2. Escalate STS NotFound to Terminal in
replace-podandobserve-imageCurrent behavior. Both tasks treat
apierrors.IsNotFoundon the StatefulSetGetas a transient wait (return nil; caller re-runs). Rationale was thatapply-statefulsetprecedes them in the plan, so a NotFound is just informer-cache lag and the next reconcile will see it.Concern. With
APIReaderplumbed in (#216), the read is strongly consistent. A NotFound now genuinely means the StatefulSet is missing — not a cache artifact. The most plausible explanation is external deletion mid-rollout (operator action, GC, namespace teardown). In that case the plan will spin forever returningExecutionRunning, with no signal to the user.Ask. Switch both NotFound paths to
Terminal(fmt.Errorf(\"statefulset %q missing during rollout (apply-statefulset preceded us); external deletion?\", name)). The planner already surfaces Terminal errors as plan failure with a Condition; this turns a silent stall into an actionable surface.Out of scope. Init plans where the STS is created later in the same plan — this only affects NodeUpdate plans where
apply-statefulsetruns upstream of these two tasks.References