Skip to content

Implement Multi-node Quorum-based Health Checking for Kvrocks Nodes #392

@Paragrf

Description

@Paragrf

Motivation
In a multi-controller deployment, triggering failover based strictly on a single controller's local probes is risky. If that controller experiences partial network degradation to a Kvrocks node — while its peers observe a healthy state — it may initiate an unwarranted failover, leading to a split-brain scenario and unnecessary traffic shedding.

Implementation

  • Voting mechanism: Each controller independently probes every Kvrocks node. Before promoting a new master, the leader collects unanimous votes from all peer controllers via POST /internal/vote. A single NO (or unreachable peer with a live lease) blocks the failover.

  • Peer discovery: Controllers register themselves in the store with a heartbeat. ListActivePeers returns peers whose leases are still alive. Expired peers are excluded from quorum rather than blocking failover.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions