Skip to content

replace-pod / observe-image: envtest cache-lag coverage + NotFound→Terminal escalation #218

@bdchatham

Description

@bdchatham

Two non-blocking follow-ups from the final cross-review of #216 (replace-pod task). Both are correctness/observability hardening on the same pair of tasks.

1. envtest harness for cache-lag regression coverage

Background. #216 fixed a same-reconcile read-after-write race where replace-pod and observe-image read the StatefulSet through the cached client (mgr.GetClient()) immediately after apply-statefulset SSA-wrote it. The cache-lag could return pre-apply state and cause replace-pod to no-op (deadlocking the chain-upgrade rollout). The fix plumbed mgr.GetAPIReader() through ExecutionConfig and switched both task reads to APIReader.Get.

Gap. Unit tests use sigs.k8s.io/controller-runtime/pkg/client/fake, which is synchronous and has no informer cache. So the unit tests cannot distinguish KubeClient.Get from APIReader.Get — a future regression that flips one back to KubeClient would pass the suite.

Ask. Add an envtest harness that:

  • Spins up a real apiserver + informer cache (controller-runtime's envtest package).
  • Drives apply-statefulset followed immediately by replace-pod / observe-image in the same reconcile pass, before the watch event has had a chance to update the cache.
  • Asserts the second task observes the post-apply spec (i.e., cache-lag does not produce false-completion).

This is the only way to lock in the fix at the test layer.

2. Escalate STS NotFound to Terminal in replace-pod and observe-image

Current behavior. Both tasks treat apierrors.IsNotFound on the StatefulSet Get as a transient wait (return nil; caller re-runs). Rationale was that apply-statefulset precedes them in the plan, so a NotFound is just informer-cache lag and the next reconcile will see it.

Concern. With APIReader plumbed in (#216), the read is strongly consistent. A NotFound now genuinely means the StatefulSet is missing — not a cache artifact. The most plausible explanation is external deletion mid-rollout (operator action, GC, namespace teardown). In that case the plan will spin forever returning ExecutionRunning, with no signal to the user.

Ask. Switch both NotFound paths to Terminal(fmt.Errorf(\"statefulset %q missing during rollout (apply-statefulset preceded us); external deletion?\", name)). The planner already surfaces Terminal errors as plan failure with a Condition; this turns a silent stall into an actionable surface.

Out of scope. Init plans where the STS is created later in the same plan — this only affects NodeUpdate plans where apply-statefulset runs upstream of these two tasks.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions