Skip to content

feat(planner): add replace-pod task to NodeUpdate plan (closes #211)#216

Merged
bdchatham merged 8 commits intomainfrom
feat/replace-pod-task
May 9, 2026
Merged

feat(planner): add replace-pod task to NodeUpdate plan (closes #211)#216
bdchatham merged 8 commits intomainfrom
feat/replace-pod-task

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

@bdchatham bdchatham commented May 8, 2026

Closes #211.

Summary

K8s native StatefulSet RollingUpdate refuses to delete pods that are not Ready, which deadlocks the rollout when seid is intentionally unready — the canonical case is halted at a chain upgrade height awaiting a binary swap. The NodeUpdate plan would loop forever in observe-image until a human ran kubectl delete pod. We hit this on 2026-05-08 during pacific-1's archive-0 v6.3.1 → v6.4.3 swap; ~9000 reconcile cycles before manual unstick.

This PR inserts a new replace-pod task between apply-service and observe-image in the NodeUpdate plan:

apply-statefulset → apply-service → replace-pod → observe-image → mark-ready

replace-pod:

  • Reads the StatefulSet, identifies its Status.UpdateRevision
  • Lists pods owned by the StatefulSet (via Selector.MatchLabels)
  • Deletes any pod whose controller-revision-hash label != UpdateRevision
  • Skips pods already terminating (deletionTimestamp set)
  • StatefulSet then recreates the pods at the new revision via its standard creation path — that path doesn't gate on readiness, so the deadlock is broken

Idempotent: re-running on a converged StatefulSet (CurrentRevision == UpdateRevision) is a no-op completion. Missing StatefulSet or empty UpdateRevision returns transient wait so the executor retries on next reconcile.

Files

File Change
internal/task/replace_pod.go (new) Task implementation
internal/task/replace_pod_test.go (new) 5 tests (stale pod / already-converged / terminating / missing STS / empty UpdateRevision)
internal/task/task.go Register TaskTypeReplacePod in deserializer registry
internal/planner/planner.go Insert task in NodeUpdate plan + add params case
internal/planner/node_update_test.go Update expected task list (4 → 5 tasks)
CLAUDE.md Document the new task in the NodeUpdate description

Test plan

  • All new replace-pod tests pass (5/5)
  • Existing planner / task / controller tests pass after the plan-shape change
  • CI green
  • Empirical verification: next pacific-1 chain upgrade (or manual archive-0 image bump) triggers a NodeUpdate plan that completes without operator intervention. The pod recreates within ~30s of the apply-statefulset task completing, regardless of seid's readiness state at the moment of the upgrade halt.

References

🤖 Generated with Claude Code


Note

Medium Risk
Changes rollout semantics for all SeiNode StatefulSets by switching to OnDelete updates and introducing controller-driven pod deletions; mistakes could cause stuck updates or unintended pod restarts. Also expands RBAC to allow pod deletions and adds cache-bypassing reads that could mask eventual consistency issues if misused.

Overview
Fixes SeiNode image rollouts that can deadlock when pods are intentionally NotReady by inserting a new replace-pod step into the NodeUpdate plan (apply-statefulset → apply-service → replace-pod → observe-image → mark-ready).

To support controller-driven rollouts, generated StatefulSets now use OnDelete update strategy, the new replace-pod task deletes old-revision pods (single-replica only) to force recreation at the new revision, and observe-image (plus task execution wiring) now uses an uncached APIReader to read freshly-applied resources. RBAC is updated to grant pod delete, and tests/docs are updated accordingly.

Reviewed by Cursor Bugbot for commit aac07fa. Bugbot is set up for automated code reviews on this repo. Configure here.

Note on rollback

If this PR is reverted in the controller, in-cluster StatefulSets will keep updateStrategy: OnDelete (server-side apply does not reset unset fields to defaults). The pre-PR controller has no replace-pod task, so image bumps would stall on halted/unready pods exactly as they did before this PR — recoverable, not wedging, but rollback does not auto-restore RollingUpdate. To fully revert, edit the StatefulSets manually: kubectl patch sts <name> -n <ns> --type=json -p '[{"op":"replace","path":"/spec/updateStrategy/type","value":"RollingUpdate"}]'.

K8s native StatefulSet RollingUpdate refuses to delete pods that are
not Ready, which deadlocks the rollout when seid is intentionally
unready — the canonical case is halted at a chain upgrade height
awaiting a binary swap. The NodeUpdate plan would loop forever in
observe-image until a human ran `kubectl delete pod`.

Adds replace-pod task between apply-service and observe-image. It
lists pods owned by the StatefulSet, identifies any whose
controller-revision-hash != sts.Status.UpdateRevision, and deletes
them. StatefulSet then recreates the pods at the new revision via
its standard creation path, which doesn't gate on readiness.

Idempotent: pods at update revision are skipped, terminating pods
are skipped, missing StatefulSet returns transient retry.

Tested:
- Stale-revision pod present → deleted, task complete
- Already at update revision → no-op, task complete
- Terminating pod → skipped (not double-deleted)
- StatefulSet missing → transient wait
- UpdateRevision empty (status not yet populated) → transient wait

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread internal/task/replace_pod.go
bdchatham and others added 5 commits May 8, 2026 16:12
- RBAC: add `delete` verb on pods (was get;list;watch only — Delete call
  would fail with forbidden in production); regenerate role.yaml.
- ObservedGeneration gate: wait until STS controller has observed our
  spec before consulting CurrentRevision/UpdateRevision; without this,
  stale revisions could cause spurious deletes.
- Multi-replica safeguard: fail Terminal if replicas>1 (we'd violate
  reverse-ordinal RollingUpdate semantics; today every SeiNode is
  replicas=1).
- Owner-reference check: skip pods that match selector by label but
  aren't actually owned by the StatefulSet.
- Skip pods missing the controller-revision-hash label entirely
  (would otherwise be classified "stale" and deleted).
- Tests: add coverage for stale ObservedGeneration, missing hash label,
  non-owned pod, and multi-replica cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Trim Execute docstring + remove what-comments inside the function
  body (per repo guidance: comments only when WHY is non-obvious).
- Replace test literals "default" / "node-1" with package-local
  consts; drop unused UID/ChainID/Image to remove goconst hits the
  PR introduced.
- Drop unused string param from podForReplace; replace ptrBool
  helper with inline bool var (modernize).
- gofmt: spaces around concat operator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Kubernetes-specialist review: with the previous default RollingUpdate
strategy, replace-pod was working *around* the STS controller's
"don't disturb unready pods" guardrail. The idiomatic K8s pattern for
"controller manages pod lifecycle, StatefulSet manages identity" is
OnDelete.

Sets `Spec.UpdateStrategy.Type = OnDeleteStatefulSetStrategyType` in
the generated StatefulSet. With OnDelete:

- Pod replacement is *always* explicit, via replace-pod
- No race with STS rolling logic — STS won't touch pods on its own
- replace-pod becomes the natural counterpart to the spec change,
  not an escape hatch around a guardrail

Behavior is unchanged on first apply (no template change → no rollout).
Existing pods continue running until next image bump triggers
replace-pod, exactly as today.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham force-pushed the feat/replace-pod-task branch from 62ba709 to 26b3ee8 Compare May 8, 2026 23:47
…ses Bugbot race)

Tasks running in the same reconcile after apply-statefulset (which does
SSA on the StatefulSet) were reading from the controller-runtime cached
client, which lags etcd by milliseconds-to-unbounded. The freshness gate
on (ObservedGeneration < Generation) doesn't fire when the entire cached
object reflects pre-apply state — both fields equal at the old value.

Result: replace-pod sees CurrentRevision == UpdateRevision (both old),
marks Complete with no deletion. observe-image sees UpdatedReplicas ==
Replicas (both pre-update) and stamps currentImage on the SeiNode. Plan
completes green, pod stays on old image. Worse than pre-PR.

Adds APIReader (uncached, direct-to-API-server) to ExecutionConfig.
replace-pod and observe-image use APIReader for their StatefulSet Get;
all other reads stay on the cached client. APIReader wired in
cmd/main.go from mgr.GetAPIReader(). Test fixtures pass the same fake
client for both fields (it satisfies both interfaces).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6ed60e5. Configure here.

Comment thread internal/task/observe_image.go
@bdchatham bdchatham merged commit e99c04c into main May 9, 2026
3 checks passed
bdchatham added a commit that referenced this pull request May 9, 2026
…e rollouts) (#217)

Pulls in #216 — NodeUpdate plans now include a `replace-pod` task that
proactively deletes pods at the old StatefulSet revision. StatefulSets
generated by this controller now use OnDelete update strategy, so pod
lifecycle is the SeiNode controller's responsibility (replace-pod). This
unblocks the chain-upgrade rollout pattern where seid is intentionally
unready (halted at upgrade height) and native RollingUpdate refuses to
disturb it.

Closes #211.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NodeUpdate plan stalls forever when existing pod is unready (chain-upgrade halt)

1 participant