doublezerod: add periodic kernel route reconciliation#3672
Conversation
f31a780 to
bd203a8
Compare
Add a reconciliation loop to the liveness manager that periodically scans the kernel routing table for missing BGP routes and reinstalls them, mitigating connectivity loss caused by external processes removing routes. Also promote liveness session down logs from DEBUG to INFO for passive/peer-passive modes so operators can see the full up/down lifecycle.
Increment RouteInstallFailures counter when a reconciliation reinstall fails, matching the observability pattern in onSessionUp. Also pre-allocate the toCheck slice.
- Re-check installed state under lock before RouteAdd to prevent resurrecting routes intentionally withdrawn by onSessionDown - Add SrcIP to kernel route lookup key for tighter matching in multi-interface setups - Reject negative RouteReconcileInterval in Validate() - Use named const for reconcile interval flag default - Log when route reconciliation is enabled at startup
bd203a8 to
99d373a
Compare
Route Reconciliation Performance Analysis
CPU cost per reconciliation cycle
Estimation methodologyLock hold (step 1): Map iteration over Netlink dump (step 2): Map build + diff (step 3): For each kernel route, we call Amortized CPU: Lock contention with HandleRxThe lock is not held during the expensive netlink syscall (step 2). The snapshot in step 1 holds Practical impact on doublezerod CPU usageGiven a ~3% baseline CPU on a modern x86 core, this change adds effectively zero overhead at realistic route counts (low hundreds). The 1M route case is pathological for a doublezerod client and would have other scaling bottlenecks (BGP convergence, session state memory, netlink install throughput) long before reconciliation matters. |
There was a problem hiding this comment.
- The
0 disableskill switch doesn't work —Validate()rewrites 0 to the 30s default, so there's no way to turn this new periodic dataplane writer off during staged rollout. - Excluded routes churn forever — on any host using a route exclude list, the reinstall counter and a Warn log fire every tick, permanently, defeating the metric's purpose.
Details inline.
PR-description corrections (not line-anchored):
- The description claims a unit test for "incrementing the install failure metric on RouteAdd error". No such test exists — the PR adds three (ReinstallsMissing, SkipsPresent, SkipsUninstalled). Either add it (a few lines with the existing mock) or drop the claim.
- It claims to add
doublezero_liveness_route_install_failures_total. That metric already existed; this PR only adds a new increment site (the new metric isdoublezero_liveness_route_reinstalls_total). - "Prevent TOCTOU race" overstates it — see the inline comment; the race is narrowed, not closed.
# Conflicts: # CHANGELOG.md
- Let RouteReconcileInterval=0 disable reconciliation (restore the kill switch); drop the duplicate default constant in the liveness package. - Skip excluded destinations in reconcileRoutes so they no longer churn the reinstall counter and logs every tick. - Hold m.mu across the installed re-check and RouteAdd to close the reconcile/onSessionDown TOCTOU race. - Match kernel routes by full destination prefix (Dst.String()) instead of IP only. - Document the main-table assumption in Netlink.RouteByProtocol. - Add tests for excluded-route skip, install-failure metric on reinstall error, and the 0-disables validation.
|
@ben-dz thanks for the thorough review — all points addressed in 789f442, with per-comment replies inline. Summary: Should-fix (inline):
PR-description corrections:
|
The isis_global_state_latest / isis_overload_bit_latest assertions read the views immediately after inserting, and under CI load the just-inserted rows were not yet visible on the pooled read connection, returning 0 rows. Retry the read until the expected rows appear so the test is deterministic.
A multi-row VALUES list with placeholders is not reliably bound by the clickhouse-go database/sql driver and can silently drop rows, leaving the isis_global_state_latest / isis_overload_bit_latest views empty so the test times out waiting for 2 rows. Insert each row in its own single-row INSERT, which the driver binds reliably.
The read polling added earlier was based on a misdiagnosis: rows appeared missing not because of read-after-write visibility delay but because the multi-row placeholder INSERT silently dropped rows. With single-row inserts the acked data is immediately queryable, so revert selectAll to a direct read.
ben-dz
left a comment
There was a problem hiding this comment.
Post-fix review at af90168. The six fixes from the prior review round all hold up, and production correctness is sound for the IBRL case (verified Table=254/RT_TABLE_MAIN, Protocol=RTPROT_BGP, explicit Src round-trip; RouteAdd is idempotent RouteReplace). No critical/high issues. Three findings worth attention: (M1) the resurrection race is closed for onSessionDown but a symmetric narrow window remains on the passive-mode WithdrawRoute path, which deletes the kernel route before clearing installed — so the PR's "Prevent TOCTOU race" claim is accurate only for the onSessionDown ordering; (M2) the new reconcile tests are partly fictional — they use table 100 (production is 254, which RouteByProtocol filters out) and the SkipsPresent test returns the identical *Route pointer, so kernel-vs-desired key matching is never genuinely exercised; (L1) the RouteAdd syscall is run while holding m.mu, a deviation from the file's no-lock-across-syscall convention (a documented, reviewer-endorsed tradeoff).
Findings not anchored to the current diff:
client/doublezerod/internal/liveness/manager.go:451— medium: Resurrection race still open on the passive-mode WithdrawRoute path. This branch issuesRouteDelete(line 451) before clearinginstalled[rk](line 463) — the opposite ordering fromonSessionDown(clears at :838, deletes at :890), which is the ordering the TOCTOU fix depends on. If reconcile snapshots the route as installed, then passive WithdrawRoute runs its RouteDelete here but is preempted before acquiring the lock at :459, reconcile's kernel query (:945) sees the route missing, takes the lock at :991 withinstalled[rk]still true, and re-adds the route. WithdrawRoute then clears the maps, leaving a permanent stale kernel route the manager believes is withdrawn. Likelihood is low but the consequence is a stale dataplane route to a withdrawn destination. Fix: clearinstalled[rk]/desired[rk]under the lock before RouteDelete here, mirroring onSessionDown. The PR's "Prevent TOCTOU race" claim is accurate only for the onSessionDown ordering.
| return true | ||
| } | ||
|
|
||
| func TestClient_Liveness_Manager_ReconcileRoutes_ReinstallsMissing(t *testing.T) { |
There was a problem hiding this comment.
Reconcile tests don't reproduce production kernel filtering or representation. newTestRoute (main_test.go:71) defaults to Table:100, but liveness runs only in IBRL mode (RT_TABLE_MAIN=254, services/ibrl.go:64), and RouteByProtocol sets only RT_FILTER_PROTOCOL so it returns only main-table routes — a table-100 route would never come back from the real backend. SkipsPresent passes only because the mock returns the identical &r.Route pointer the manager installed, guaranteeing a key match regardless of table and bypassing real representation differences (4-byte vs 16-byte net.IP, prefsrc echo, mask normalization). Recommend using Table:254 and having the mock return a freshly-constructed *routing.Route with the same field values so key construction is genuinely exercised on both sides.
Resolves: #3669
Summary of Changes
--route-liveness-reconcile-interval;0disables it), detects BGP routes that should already be installed but are missing, and reinstalls themdoublezero_liveness_route_reinstalls_totalPrometheus metric to count reinstalls, and increment the existingdoublezero_liveness_route_install_failures_totalwhen a reinstall'sRouteAddfailsRouteAddfor excluded routes, so they are never in the kernel and must not be flagged "missing" every tickonSessionDownTOCTOU race by holdingm.muacross theinstalledre-check andRouteAdd;onSessionDownflips the flag under the lock before issuingRouteDelete, so reconcile either observes the withdrawal and skips or completes its add before the delete landsDst.String()) plus source IP, so routes with the same(table, dst-ip, nexthop)but different masks or source IPs are matched independentlyTesting Verification
RouteAdderrors, and thatValidate()leavesRouteReconcileInterval=0untouched so the kill switch worksNetlinkerto simulate kernel route state; the reconciliation ticker is set totime.Hourin tests soreconcileRoutes()can be driven directly without background interference