feat(planner): add replace-pod task to NodeUpdate plan (closes #211) by bdchatham · Pull Request #216 · sei-protocol/sei-k8s-controller

bdchatham · 2026-05-08T23:03:57Z

Closes #211.

Summary

K8s native StatefulSet RollingUpdate refuses to delete pods that are not Ready, which deadlocks the rollout when seid is intentionally unready — the canonical case is halted at a chain upgrade height awaiting a binary swap. The NodeUpdate plan would loop forever in observe-image until a human ran kubectl delete pod. We hit this on 2026-05-08 during pacific-1's archive-0 v6.3.1 → v6.4.3 swap; ~9000 reconcile cycles before manual unstick.

This PR inserts a new replace-pod task between apply-service and observe-image in the NodeUpdate plan:

apply-statefulset → apply-service → replace-pod → observe-image → mark-ready

replace-pod:

Reads the StatefulSet, identifies its Status.UpdateRevision
Lists pods owned by the StatefulSet (via Selector.MatchLabels)
Deletes any pod whose controller-revision-hash label != UpdateRevision
Skips pods already terminating (deletionTimestamp set)
StatefulSet then recreates the pods at the new revision via its standard creation path — that path doesn't gate on readiness, so the deadlock is broken

Idempotent: re-running on a converged StatefulSet (CurrentRevision == UpdateRevision) is a no-op completion. Missing StatefulSet or empty UpdateRevision returns transient wait so the executor retries on next reconcile.

Files

File	Change
`internal/task/replace_pod.go` (new)	Task implementation
`internal/task/replace_pod_test.go` (new)	5 tests (stale pod / already-converged / terminating / missing STS / empty UpdateRevision)
`internal/task/task.go`	Register `TaskTypeReplacePod` in deserializer registry
`internal/planner/planner.go`	Insert task in NodeUpdate plan + add params case
`internal/planner/node_update_test.go`	Update expected task list (4 → 5 tasks)
`CLAUDE.md`	Document the new task in the NodeUpdate description

Test plan

All new replace-pod tests pass (5/5)
Existing planner / task / controller tests pass after the plan-shape change
CI green
Empirical verification: next pacific-1 chain upgrade (or manual archive-0 image bump) triggers a NodeUpdate plan that completes without operator intervention. The pod recreates within ~30s of the apply-statefulset task completing, regardless of seid's readiness state at the moment of the upgrade halt.

References

NodeUpdate plan stalls forever when existing pod is unready (chain-upgrade halt) #211 — issue with full failure analysis
feat(ensure_pvc): accept ReadWriteOncePod on imported archive PVCs #214 — RWOP validator change (separate prerequisite for SELinux mount labeling)
pacific-1 archive-0 v6.4.3 swap incident: 2026-05-08, ~9000 reconcile cycles before manual kubectl delete pod unstuck

🤖 Generated with Claude Code

Note

Medium Risk
Changes rollout semantics for all SeiNode StatefulSets by switching to OnDelete updates and introducing controller-driven pod deletions; mistakes could cause stuck updates or unintended pod restarts. Also expands RBAC to allow pod deletions and adds cache-bypassing reads that could mask eventual consistency issues if misused.

Overview
Fixes SeiNode image rollouts that can deadlock when pods are intentionally NotReady by inserting a new replace-pod step into the NodeUpdate plan (apply-statefulset → apply-service → replace-pod → observe-image → mark-ready).

To support controller-driven rollouts, generated StatefulSets now use OnDelete update strategy, the new replace-pod task deletes old-revision pods (single-replica only) to force recreation at the new revision, and observe-image (plus task execution wiring) now uses an uncached APIReader to read freshly-applied resources. RBAC is updated to grant pod delete, and tests/docs are updated accordingly.

^{Reviewed by Cursor Bugbot for commit aac07fa. Bugbot is set up for automated code reviews on this repo. Configure here.}

Note on rollback

If this PR is reverted in the controller, in-cluster StatefulSets will keep updateStrategy: OnDelete (server-side apply does not reset unset fields to defaults). The pre-PR controller has no replace-pod task, so image bumps would stall on halted/unready pods exactly as they did before this PR — recoverable, not wedging, but rollback does not auto-restore RollingUpdate. To fully revert, edit the StatefulSets manually: kubectl patch sts <name> -n <ns> --type=json -p '[{"op":"replace","path":"/spec/updateStrategy/type","value":"RollingUpdate"}]'.

K8s native StatefulSet RollingUpdate refuses to delete pods that are not Ready, which deadlocks the rollout when seid is intentionally unready — the canonical case is halted at a chain upgrade height awaiting a binary swap. The NodeUpdate plan would loop forever in observe-image until a human ran `kubectl delete pod`. Adds replace-pod task between apply-service and observe-image. It lists pods owned by the StatefulSet, identifies any whose controller-revision-hash != sts.Status.UpdateRevision, and deletes them. StatefulSet then recreates the pods at the new revision via its standard creation path, which doesn't gate on readiness. Idempotent: pods at update revision are skipped, terminating pods are skipped, missing StatefulSet returns transient retry. Tested: - Stale-revision pod present → deleted, task complete - Already at update revision → no-op, task complete - Terminating pod → skipped (not double-deleted) - StatefulSet missing → transient wait - UpdateRevision empty (status not yet populated) → transient wait Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- RBAC: add `delete` verb on pods (was get;list;watch only — Delete call would fail with forbidden in production); regenerate role.yaml. - ObservedGeneration gate: wait until STS controller has observed our spec before consulting CurrentRevision/UpdateRevision; without this, stale revisions could cause spurious deletes. - Multi-replica safeguard: fail Terminal if replicas>1 (we'd violate reverse-ordinal RollingUpdate semantics; today every SeiNode is replicas=1). - Owner-reference check: skip pods that match selector by label but aren't actually owned by the StatefulSet. - Skip pods missing the controller-revision-hash label entirely (would otherwise be classified "stale" and deleted). - Tests: add coverage for stale ObservedGeneration, missing hash label, non-owned pod, and multi-replica cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Trim Execute docstring + remove what-comments inside the function body (per repo guidance: comments only when WHY is non-obvious). - Replace test literals "default" / "node-1" with package-local consts; drop unused UID/ChainID/Image to remove goconst hits the PR introduced. - Drop unused string param from podForReplace; replace ptrBool helper with inline bool var (modernize). - gofmt: spaces around concat operator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per Kubernetes-specialist review: with the previous default RollingUpdate strategy, replace-pod was working *around* the STS controller's "don't disturb unready pods" guardrail. The idiomatic K8s pattern for "controller manages pod lifecycle, StatefulSet manages identity" is OnDelete. Sets `Spec.UpdateStrategy.Type = OnDeleteStatefulSetStrategyType` in the generated StatefulSet. With OnDelete: - Pod replacement is *always* explicit, via replace-pod - No race with STS rolling logic — STS won't touch pods on its own - replace-pod becomes the natural counterpart to the spec change, not an escape hatch around a guardrail Behavior is unchanged on first apply (no template change → no rollout). Existing pods continue running until next image bump triggers replace-pod, exactly as today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ses Bugbot race) Tasks running in the same reconcile after apply-statefulset (which does SSA on the StatefulSet) were reading from the controller-runtime cached client, which lags etcd by milliseconds-to-unbounded. The freshness gate on (ObservedGeneration < Generation) doesn't fire when the entire cached object reflects pre-apply state — both fields equal at the old value. Result: replace-pod sees CurrentRevision == UpdateRevision (both old), marks Complete with no deletion. observe-image sees UpdatedReplicas == Replicas (both pre-update) and stamps currentImage on the SeiNode. Plan completes green, pod stays on old image. Worse than pre-PR. Adds APIReader (uncached, direct-to-API-server) to ExecutionConfig. replace-pod and observe-image use APIReader for their StatefulSet Get; all other reads stay on the cached client. APIReader wired in cmd/main.go from mgr.GetAPIReader(). Test fixtures pass the same fake client for both fields (it satisfies both interfaces). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 6ed60e5. Configure here.}

…e rollouts) (#217) Pulls in #216 — NodeUpdate plans now include a `replace-pod` task that proactively deletes pods at the old StatefulSet revision. StatefulSets generated by this controller now use OnDelete update strategy, so pod lifecycle is the SeiNode controller's responsibility (replace-pod). This unblocks the chain-upgrade rollout pattern where seid is intentionally unready (halted at upgrade height) and native RollingUpdate refuses to disturb it. Closes #211. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor Bot reviewed May 8, 2026

View reviewed changes

Comment thread internal/task/replace_pod.go

bdchatham and others added 5 commits May 8, 2026 16:12

fix(test): import appsv1 for OnDelete strategy assertion; trim comments

e6eaa34

review: MaxRetries=3 on replace-pod for transient pod-delete conflicts

26b3ee8

bdchatham force-pushed the feat/replace-pod-task branch from 62ba709 to 26b3ee8 Compare May 8, 2026 23:47

cursor Bot reviewed May 9, 2026

View reviewed changes

Comment thread internal/task/observe_image.go

test: wire APIReader into controller-side test fixtures

aac07fa

bdchatham merged commit e99c04c into main May 9, 2026
3 checks passed

This was referenced May 9, 2026

chore: bump controller image to e99c04c (replace-pod for chain-upgrade rollouts) #217

Merged

replace-pod / observe-image: envtest cache-lag coverage + NotFound→Terminal escalation #218

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(planner): add replace-pod task to NodeUpdate plan (closes #211)#216

feat(planner): add replace-pod task to NodeUpdate plan (closes #211)#216
bdchatham merged 8 commits intomainfrom
feat/replace-pod-task

bdchatham commented May 8, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bdchatham commented May 8, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Test plan

References

Note on rollback

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bdchatham commented May 8, 2026 •

edited by cursor Bot

Loading