Skip to content

discover-peers: refresh Sources on retry + cross-watch peer SeiNodes #210

@bdchatham

Description

@bdchatham

Problem

Two related correctness gaps in the SeiNode controller's peer-discovery flow:

1. DiscoverPeersTask.Sources is frozen at plan-build time

internal/planner/planner.go:594-619 (discoverPeersTask) snapshots node.Status.ResolvedPeers into the planned task's params at the moment the plan is built. The executor's retry path at internal/planner/executor.go:189-196 resets RetryCount/Status/SubmittedAt but does not refresh t.Params. So if node.Status.ResolvedPeers was incomplete (e.g., resolved when only 1 of N peer SeiNodes existed) at plan-build time, every retry submits the same stale endpoint list.

In the harbor nightly co-applied validator + RPC fleet, this is masked by the timing-friendly case: the resolved single endpoint (validator-0) eventually becomes reachable, retry succeeds. But in any race where the snapshotted peer is deleted, scaled away, or remains unreachable (DNS blackhole, network policy), retries are useless and the plan terminal-fails after the budget exhausts.

2. SetupWithManager doesn't watch peer SeiNodes

internal/controller/node/controller.go:175-184 registers For(SeiNode{}) with GenerationChangedPredicate and Owns() for owned children (StatefulSet/Service/PVC/Job). There is no Watches() on other SeiNodes. Status changes on validator SeiNodes (becoming Ready, being deleted, IP changes) do not trigger reconciliation of the RPC SeiNode that depends on them as peers.

Practical impact: reconcilePeers only re-resolves Status.ResolvedPeers on the RPC SeiNode's own reconcile cycle. Combined with #1 above, this means the RPC SeiNode is "blind" to validator state changes for the duration of an in-flight discover-peers task.

Impact

#209 (discoverPeersMaxRetries=20) papers over the harbor symptom by giving the snapshotted endpoint enough time to become reachable. But the underlying behavior is incorrect — the controller is not picking up new/updated peer SeiNodes. Future flakes will surface when:

  • A validator pod is replaced (DNS records change → frozen Sources outdated).
  • The peer-source label selector matches a SeiNode that doesn't exist yet at plan-build time.
  • A validator SeiNode is deleted between plan-build and discover-peers retry.

Relevant experts

  • kubernetes-specialist — owns controller-runtime watch/event-source patterns, EnqueueRequestsFromMapFunc.

Proposed approach

Fix #1 — refresh Sources on retry

Two options:

(a) Repopulate task params on each reconcile. Before submitting (or on retry), the executor or a planner hook reads node.Status.ResolvedPeers and updates the in-flight task's Params to reflect current state. Smaller diff, but couples executor to a task-specific shape.

(b) Move endpoint resolution into the sidecar. The task params carry a PeerSource (e.g. label selector) instead of resolved endpoints; the sidecar resolves at task-execution time via the K8s API. Better separation of concerns. Requires sidecar to have list-peer-SeiNodes RBAC.

Recommend (a) for MVP — smaller blast radius, easier to revert if it surfaces a different problem.

Fix #2 — cross-watch peer SeiNodes

Add to SetupWithManager:

.Watches(&seiv1alpha1.SeiNode{},
    handler.EnqueueRequestsFromMapFunc(r.enqueuePeerDependents),
    builder.WithPredicates(predicate.Or(
        predicate.GenerationChangedPredicate{},
        statusResolvedPeersChangedPredicate{},
    )),
)

Where enqueuePeerDependents lists all SeiNodes whose spec.peers[].label.selector matches the changed SeiNode's labels and enqueues each. Predicate filters out updates that don't change peer-relevant fields (avoid reconcile storms on every status patch).

Architectural constraints

  • Watch must filter aggressively to avoid reconcile storms. Peer-relevant changes are: SeiNode creation, deletion, label changes (which affect what selectors match), and status transitions to/from PhaseReady (peer becoming reachable / unreachable).
  • Sources-refresh on retry must not race with the executor's status patch. Probably needs to be done in a single transaction by the reconciler before submitting the next task attempt.
  • Backwards compatible — existing SNDs with EC2Tags or Static peer sources don't go through Status.ResolvedPeers and don't change.

Acceptance criteria

  • DiscoverPeersTask.Sources reflects current Status.ResolvedPeers on each retry attempt (test: simulate ResolvedPeers change between submission and retry, assert sidecar sees updated list).
  • Reconciliation of an RPC SeiNode is triggered when a peer SeiNode's labels change, the peer is created/deleted, or the peer transitions to/from PhaseReady.
  • No reconcile storm under steady-state (validator status is patched frequently; predicate must filter).
  • MaxRetries=20 on discover-peers (from fix(planner): set MaxRetries on TaskDiscoverPeers #209) remains as a safety net, not as the primary correctness mechanism.

Out of scope

  • Replacing the entire peer-discovery model (e.g. with a service-mesh approach). Stay within the current PeerSource shape.
  • Multi-cluster peer discovery (peers in another K8s cluster). Out of scope; current label-resolver is intra-namespace.

References

  • Frozen-Sources site: internal/planner/planner.go:594-619 (discoverPeersTask).
  • Reconciler watch site: internal/controller/node/controller.go:175-184 (SetupWithManager).
  • Resolver: internal/controller/node/peers.go:13-66 (reconcilePeers + resolveLabelPeers).
  • Stop-gap retry budget: fix(planner): set MaxRetries on TaskDiscoverPeers #209 (this work depends on it landing first).
  • Operational evidence: harbor nightly release-test 2026-05-08 — discover-peers terminal-failed on first attempt, masked by fix(planner): set MaxRetries on TaskDiscoverPeers #209's retry-budget bump.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions