replace-pod / observe-image: envtest cache-lag coverage + NotFound→Terminal escalation

Two non-blocking follow-ups from the final cross-review of #216 (replace-pod task). Both are correctness/observability hardening on the same pair of tasks.

## 1. envtest harness for cache-lag regression coverage

**Background.** #216 fixed a same-reconcile read-after-write race where `replace-pod` and `observe-image` read the StatefulSet through the cached client (`mgr.GetClient()`) immediately after `apply-statefulset` SSA-wrote it. The cache-lag could return pre-apply state and cause `replace-pod` to no-op (deadlocking the chain-upgrade rollout). The fix plumbed `mgr.GetAPIReader()` through `ExecutionConfig` and switched both task reads to `APIReader.Get`.

**Gap.** Unit tests use `sigs.k8s.io/controller-runtime/pkg/client/fake`, which is synchronous and has no informer cache. So the unit tests cannot distinguish `KubeClient.Get` from `APIReader.Get` — a future regression that flips one back to `KubeClient` would pass the suite.

**Ask.** Add an envtest harness that:
- Spins up a real apiserver + informer cache (controller-runtime's `envtest` package).
- Drives `apply-statefulset` followed immediately by `replace-pod` / `observe-image` in the same reconcile pass, before the watch event has had a chance to update the cache.
- Asserts the second task observes the post-apply spec (i.e., cache-lag does not produce false-completion).

This is the only way to lock in the fix at the test layer.

## 2. Escalate STS NotFound to Terminal in `replace-pod` and `observe-image`

**Current behavior.** Both tasks treat `apierrors.IsNotFound` on the StatefulSet `Get` as a transient wait (return nil; caller re-runs). Rationale was that `apply-statefulset` precedes them in the plan, so a NotFound is just informer-cache lag and the next reconcile will see it.

**Concern.** With `APIReader` plumbed in (#216), the read is strongly consistent. A NotFound now genuinely means the StatefulSet is missing — not a cache artifact. The most plausible explanation is external deletion mid-rollout (operator action, GC, namespace teardown). In that case the plan will spin forever returning `ExecutionRunning`, with no signal to the user.

**Ask.** Switch both NotFound paths to `Terminal(fmt.Errorf(\"statefulset %q missing during rollout (apply-statefulset preceded us); external deletion?\", name))`. The planner already surfaces Terminal errors as plan failure with a Condition; this turns a silent stall into an actionable surface.

**Out of scope.** Init plans where the STS is created later in the same plan — this only affects NodeUpdate plans where `apply-statefulset` runs upstream of these two tasks.

## References

- PR: #216
- Bugbot finding (cache-lag): https://github.com/sei-protocol/sei-k8s-controller/pull/216 (review thread)
- Final cross-review: GO with these two as non-blocking follow-ups

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace-pod / observe-image: envtest cache-lag coverage + NotFound→Terminal escalation #218

1. envtest harness for cache-lag regression coverage

2. Escalate STS NotFound to Terminal in `replace-pod` and `observe-image`

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

replace-pod / observe-image: envtest cache-lag coverage + NotFound→Terminal escalation #218

Description

1. envtest harness for cache-lag regression coverage

2. Escalate STS NotFound to Terminal in replace-pod and observe-image

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

2. Escalate STS NotFound to Terminal in `replace-pod` and `observe-image`