You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Two related correctness gaps in the SeiNode controller's peer-discovery flow:
1. DiscoverPeersTask.Sources is frozen at plan-build time
internal/planner/planner.go:594-619 (discoverPeersTask) snapshots node.Status.ResolvedPeers into the planned task's params at the moment the plan is built. The executor's retry path at internal/planner/executor.go:189-196 resets RetryCount/Status/SubmittedAt but does not refresh t.Params. So if node.Status.ResolvedPeers was incomplete (e.g., resolved when only 1 of N peer SeiNodes existed) at plan-build time, every retry submits the same stale endpoint list.
In the harbor nightly co-applied validator + RPC fleet, this is masked by the timing-friendly case: the resolved single endpoint (validator-0) eventually becomes reachable, retry succeeds. But in any race where the snapshotted peer is deleted, scaled away, or remains unreachable (DNS blackhole, network policy), retries are useless and the plan terminal-fails after the budget exhausts.
2. SetupWithManager doesn't watch peer SeiNodes
internal/controller/node/controller.go:175-184 registers For(SeiNode{}) with GenerationChangedPredicate and Owns() for owned children (StatefulSet/Service/PVC/Job). There is no Watches() on other SeiNodes. Status changes on validator SeiNodes (becoming Ready, being deleted, IP changes) do not trigger reconciliation of the RPC SeiNode that depends on them as peers.
Practical impact: reconcilePeers only re-resolves Status.ResolvedPeers on the RPC SeiNode's own reconcile cycle. Combined with #1 above, this means the RPC SeiNode is "blind" to validator state changes for the duration of an in-flight discover-peers task.
Impact
#209 (discoverPeersMaxRetries=20) papers over the harbor symptom by giving the snapshotted endpoint enough time to become reachable. But the underlying behavior is incorrect — the controller is not picking up new/updated peer SeiNodes. Future flakes will surface when:
A validator pod is replaced (DNS records change → frozen Sources outdated).
The peer-source label selector matches a SeiNode that doesn't exist yet at plan-build time.
A validator SeiNode is deleted between plan-build and discover-peers retry.
(a) Repopulate task params on each reconcile. Before submitting (or on retry), the executor or a planner hook reads node.Status.ResolvedPeers and updates the in-flight task's Params to reflect current state. Smaller diff, but couples executor to a task-specific shape.
(b) Move endpoint resolution into the sidecar. The task params carry a PeerSource (e.g. label selector) instead of resolved endpoints; the sidecar resolves at task-execution time via the K8s API. Better separation of concerns. Requires sidecar to have list-peer-SeiNodes RBAC.
Recommend (a) for MVP — smaller blast radius, easier to revert if it surfaces a different problem.
Where enqueuePeerDependents lists all SeiNodes whose spec.peers[].label.selector matches the changed SeiNode's labels and enqueues each. Predicate filters out updates that don't change peer-relevant fields (avoid reconcile storms on every status patch).
Architectural constraints
Watch must filter aggressively to avoid reconcile storms. Peer-relevant changes are: SeiNode creation, deletion, label changes (which affect what selectors match), and status transitions to/from PhaseReady (peer becoming reachable / unreachable).
Sources-refresh on retry must not race with the executor's status patch. Probably needs to be done in a single transaction by the reconciler before submitting the next task attempt.
Backwards compatible — existing SNDs with EC2Tags or Static peer sources don't go through Status.ResolvedPeers and don't change.
Acceptance criteria
DiscoverPeersTask.Sources reflects current Status.ResolvedPeers on each retry attempt (test: simulate ResolvedPeers change between submission and retry, assert sidecar sees updated list).
Reconciliation of an RPC SeiNode is triggered when a peer SeiNode's labels change, the peer is created/deleted, or the peer transitions to/from PhaseReady.
No reconcile storm under steady-state (validator status is patched frequently; predicate must filter).
Problem
Two related correctness gaps in the SeiNode controller's peer-discovery flow:
1.
DiscoverPeersTask.Sourcesis frozen at plan-build timeinternal/planner/planner.go:594-619(discoverPeersTask) snapshotsnode.Status.ResolvedPeersinto the planned task's params at the moment the plan is built. The executor's retry path atinternal/planner/executor.go:189-196resetsRetryCount/Status/SubmittedAtbut does not refresht.Params. So ifnode.Status.ResolvedPeerswas incomplete (e.g., resolved when only 1 of N peer SeiNodes existed) at plan-build time, every retry submits the same stale endpoint list.In the harbor nightly co-applied validator + RPC fleet, this is masked by the timing-friendly case: the resolved single endpoint (validator-0) eventually becomes reachable, retry succeeds. But in any race where the snapshotted peer is deleted, scaled away, or remains unreachable (DNS blackhole, network policy), retries are useless and the plan terminal-fails after the budget exhausts.
2.
SetupWithManagerdoesn't watch peer SeiNodesinternal/controller/node/controller.go:175-184registersFor(SeiNode{})withGenerationChangedPredicateandOwns()for owned children (StatefulSet/Service/PVC/Job). There is noWatches()on other SeiNodes. Status changes on validator SeiNodes (becoming Ready, being deleted, IP changes) do not trigger reconciliation of the RPC SeiNode that depends on them as peers.Practical impact:
reconcilePeersonly re-resolvesStatus.ResolvedPeerson the RPC SeiNode's own reconcile cycle. Combined with #1 above, this means the RPC SeiNode is "blind" to validator state changes for the duration of an in-flight discover-peers task.Impact
#209 (
discoverPeersMaxRetries=20) papers over the harbor symptom by giving the snapshotted endpoint enough time to become reachable. But the underlying behavior is incorrect — the controller is not picking up new/updated peer SeiNodes. Future flakes will surface when:Relevant experts
kubernetes-specialist— owns controller-runtime watch/event-source patterns, EnqueueRequestsFromMapFunc.Proposed approach
Fix #1 — refresh
Sourceson retryTwo options:
(a) Repopulate task params on each reconcile. Before submitting (or on retry), the executor or a planner hook reads
node.Status.ResolvedPeersand updates the in-flight task'sParamsto reflect current state. Smaller diff, but couples executor to a task-specific shape.(b) Move endpoint resolution into the sidecar. The task params carry a
PeerSource(e.g. label selector) instead of resolved endpoints; the sidecar resolves at task-execution time via the K8s API. Better separation of concerns. Requires sidecar to have list-peer-SeiNodes RBAC.Recommend (a) for MVP — smaller blast radius, easier to revert if it surfaces a different problem.
Fix #2 — cross-watch peer SeiNodes
Add to
SetupWithManager:Where
enqueuePeerDependentslists all SeiNodes whosespec.peers[].label.selectormatches the changed SeiNode's labels and enqueues each. Predicate filters out updates that don't change peer-relevant fields (avoid reconcile storms on every status patch).Architectural constraints
PhaseReady(peer becoming reachable / unreachable).Sources-refresh on retry must not race with the executor's status patch. Probably needs to be done in a single transaction by the reconciler before submitting the next task attempt.EC2TagsorStaticpeer sources don't go throughStatus.ResolvedPeersand don't change.Acceptance criteria
DiscoverPeersTask.Sourcesreflects currentStatus.ResolvedPeerson each retry attempt (test: simulate ResolvedPeers change between submission and retry, assert sidecar sees updated list).PhaseReady.MaxRetries=20on discover-peers (from fix(planner): set MaxRetries on TaskDiscoverPeers #209) remains as a safety net, not as the primary correctness mechanism.Out of scope
PeerSourceshape.References
internal/planner/planner.go:594-619(discoverPeersTask).internal/controller/node/controller.go:175-184(SetupWithManager).internal/controller/node/peers.go:13-66(reconcilePeers+resolveLabelPeers).