Skip to content

Detect spec drift on Running nodes for mid-life SigningKey patch (and future validator mode switch) #137

@bdchatham

Description

@bdchatham

Background

PR #136 shipped validator.signingKey.secret.secretName for the single-shot validator deployment use case — deploy a SeiNode with SigningKey set from creation and the controller mounts the Secret on the production StatefulSet pod.

What it explicitly does not support: patching SigningKey onto an already-Running validator. buildRunningPlan (internal/planner/planner.go:621-628) only detects image drift today, so a kubectl patch seinode --patch '{"spec":{"validator":{"signingKey":...}}}' against a Running node is a silent no-op — the pod doesn't restart, the Secret never mounts, the validator never starts signing. Documented in LLD §11.

Why this matters

Primary use case — zero-downtime migration cutover

The single-shot deployment in PR #136 trades cutover downtime (bootstrap-Job sync + StatefulSet catch-up) for implementation simplicity. For arctic-1 testnet this is fine; for pacific-1 the downtime envelope may not be acceptable. The original LLD §8 design was a two-phase cutover:

  1. Deploy SeiNode without signingKey → bootstrap + sync as a non-signing observer (zero risk, no downtime on the old EC2 validator)
  2. Stop EC2 → wait M blocks → patch signingKey in → pod re-rolls with Secret mounted → seid signs

Phase 2 requires drift detection. Without it, the operator's choices are (a) accept the single-shot downtime, or (b) delete-and-recreate the SeiNode (which deletes the data PVC unless dataVolume.import is wired, costing the bootstrap+sync work).

Future use case — validator mode switch

A separate, larger ask: convert a Running fullNode SeiNode into a validator SeiNode by patching the spec (e.g., spec.fullNode → spec.validator with a signingKey). Same drift-detection mechanic, larger pod-spec surface (mode string in seid config, port configuration, probes). Out of scope for v1 of this issue, but the underlying machinery is identical — listing here so the design accommodates it additively.

Proposed mechanic

Reuse the existing NodeUpdate plan shape (buildNodeUpdatePlan at internal/planner/planner.go:656):

  1. Track applied state in SeiNodeStatus. New field — likely Status.SigningKeyMountedSecret string — stamped by observe-image (or a sibling task) when the StatefulSet rollout completes. For mode-switch, an analogous Status.ObservedMode would track the currently-running mode.
  2. Extend buildRunningPlan to detect drift beyond image:
    if signingKeyDrift(node) {
        return buildSigningKeyUpdatePlan(node)
    }
  3. Build a re-apply plan that runs validate-signing-key (already present, gates on Secret correctness) → apply-statefulset (server-side apply triggers the rolling update) → observe-image-equivalent rollout watcher → mark-ready. Stamp the new applied state on completion.

The pod restart is the standard StatefulSet rolling update mechanic — no special "kill seid" step is needed. seid restarts naturally when the pod is replaced and reads the newly-mounted key file at startup.

Acceptance criteria (v1 — SigningKey drift only)

  • Status.SigningKeyMountedSecret field added to SeiNodeStatus and stamped by the rollout-watch task on success
  • buildRunningPlan detects spec.validator.signingKey.secret.secretName != status.SigningKeyMountedSecret and triggers a re-apply plan
  • Re-apply plan includes validate-signing-key so Secret defects are caught before the pod rolls
  • Integration test: deploy SeiNode without SigningKey → reaches Running as observer → patch SigningKey in → pod restarts with Secret mounted → SigningKeyReady=True → StatefulSet has the volume + subPath mount
  • LLD §8 updated to re-document the zero-downtime cutover flow as supported
  • .tide/validator-migration.md updated with both single-shot (existing, downtime) and zero-downtime cutover variants

Out of scope (for v1 of this issue)

  • Mode switch (full-node → validator). Same drift-detection mechanic, larger surface; file as a follow-up once SigningKey drift is shipped and we have a concrete customer ask.
  • Demoting a validator to non-signing (clearing SigningKey on a Running validator). Currently no field-level immutability blocks unset, but the workflow is unsupported per LLD §11.
  • Rotating secretName on a Running validator. secretName is immutable by XValidation: self == oldSelf; rotating consensus key requires MsgEditValidator and is a coordinated on-chain operation, not a controller feature.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions