Skip to content

doublezerod: add periodic kernel route reconciliation#3672

Open
nikw9944 wants to merge 9 commits into
mainfrom
nikw9944/doublezero-3669
Open

doublezerod: add periodic kernel route reconciliation#3672
nikw9944 wants to merge 9 commits into
mainfrom
nikw9944/doublezero-3669

Conversation

@nikw9944

@nikw9944 nikw9944 commented May 5, 2026

Copy link
Copy Markdown
Contributor

Resolves: #3669

Summary of Changes

  • Add a periodic route reconciliation goroutine to the liveness manager that scans the kernel routing table (default every 30s, configurable via --route-liveness-reconcile-interval; 0 disables it), detects BGP routes that should already be installed but are missing, and reinstalls them
  • This mitigates the case where another process or an administrator mistakenly deletes a doublezero route from the kernel routing table
  • Add the doublezero_liveness_route_reinstalls_total Prometheus metric to count reinstalls, and increment the existing doublezero_liveness_route_install_failures_total when a reinstall's RouteAdd fails
  • Skip excluded destinations during reconciliation: the manager's Netlinker no-ops RouteAdd for excluded routes, so they are never in the kernel and must not be flagged "missing" every tick
  • Close a reconcile/onSessionDown TOCTOU race by holding m.mu across the installed re-check and RouteAdd; onSessionDown flips the flag under the lock before issuing RouteDelete, so reconcile either observes the withdrawal and skips or completes its add before the delete lands
  • Match kernel routes by full destination prefix (Dst.String()) plus source IP, so routes with the same (table, dst-ip, nexthop) but different masks or source IPs are matched independently
  • Promote "session down (passive; keeping route)" log messages from Debug to Info, otherwise it's possible to log multiple 'liveness: session up' messages in a row

Testing Verification

  • Unit tests cover: reinstalling a missing route, skipping a route present in kernel, skipping an uninstalled route (active mode, session never went Up), skipping an excluded route (no reinstall, counter stays at 0), incrementing the install-failure metric when a reinstall's RouteAdd errors, and that Validate() leaves RouteReconcileInterval=0 untouched so the kill switch works
  • Tests use a mock Netlinker to simulate kernel route state; the reconciliation ticker is set to time.Hour in tests so reconcileRoutes() can be driven directly without background interference
  • Full liveness and routing test suites pass

@nikw9944 nikw9944 marked this pull request as ready for review May 5, 2026 21:07
@nikw9944 nikw9944 force-pushed the nikw9944/doublezero-3669 branch from f31a780 to bd203a8 Compare May 5, 2026 21:37
nikw9944 added 4 commits May 5, 2026 21:49
Add a reconciliation loop to the liveness manager that periodically
scans the kernel routing table for missing BGP routes and reinstalls
them, mitigating connectivity loss caused by external processes
removing routes.

Also promote liveness session down logs from DEBUG to INFO for
passive/peer-passive modes so operators can see the full up/down
lifecycle.
Increment RouteInstallFailures counter when a reconciliation reinstall
fails, matching the observability pattern in onSessionUp. Also
pre-allocate the toCheck slice.
- Re-check installed state under lock before RouteAdd to prevent
  resurrecting routes intentionally withdrawn by onSessionDown
- Add SrcIP to kernel route lookup key for tighter matching in
  multi-interface setups
- Reject negative RouteReconcileInterval in Validate()
- Use named const for reconcile interval flag default
- Log when route reconciliation is enabled at startup
@nikw9944 nikw9944 force-pushed the nikw9944/doublezero-3669 branch from bd203a8 to 99d373a Compare May 5, 2026 21:50
@nikw9944

nikw9944 commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

Route Reconciliation Performance Analysis

reconcileRoutes() runs every 30s with 4 phases:

  1. Lock → snapshot → unlock — iterates installed ∩ desired into a local slice
  2. Netlink dump (no lock held)RouteListFiltered(FAMILY_V4, RTPROT_BGP) dumps all kernel BGP routes
  3. Build hash set + diff — builds map[kernelKey]struct{} from kernel routes, checks each installed route against it
  4. Reinstall missing — brief re-lock per missing route, then RouteAdd

CPU cost per reconciliation cycle

Routes Lock hold (step 1) Netlink dump (step 2) Map build + diff (step 3) Total per cycle Amortized over 30s
100 ~1 μs ~0.5 ms ~50 μs ~0.6 ms 0.002%
1,000 ~5 μs ~3 ms ~0.5 ms ~4 ms 0.013%
1,000,000 ~5 ms ~1-2 s ~300-500 ms ~2-3 s ~7-10%

Estimation methodology

Lock hold (step 1): Map iteration over installed checking membership in desired, appending matches to a slice. Go map iteration is ~10ns/entry. 100 entries × 10ns = 1μs; 1000 × 10ns = 10μs (rounded to ~5μs accounting for the append being fast). At 1M entries, iteration + slice growth + memory allocation dominates: ~5ms.

Netlink dump (step 2): RouteListFiltered issues a single RTM_GETROUTE netlink dump; the kernel filters by protocol server-side and streams matching routes back. Each route message is ~100-200 bytes of netlink payload. At 100 routes that's ~10-20KB parsed by vishvananda/netlink — well under 1ms. At 1,000 routes, ~100-200KB, ~3ms. At 1M routes, ~100-200MB of netlink data to receive and deserialize — estimated 1-2s based on netlink socket throughput (~100-200MB/s for dump operations).

Map build + diff (step 3): For each kernel route, we call IP.To4().String() on 3 fields (dst, nexthop, src) creating heap-allocated strings, then insert into map[kernelKey]struct{}. String conversion + map insert is ~200-500ns/route. At 100 routes: ~50μs. At 1,000: ~500μs. At 1M: ~300-500ms (dominated by allocations and GC pressure from ~3M short-lived strings). The diff itself is O(installed) map lookups at O(1) each — negligible relative to the build.

Amortized CPU: total_per_cycle / 30s. E.g., at 1,000 routes: 4ms / 30,000ms = 0.013%.

Lock contention with HandleRx

The lock is not held during the expensive netlink syscall (step 2). The snapshot in step 1 holds m.mu only during map iteration. HandleRx (the hot path) also holds m.mu for its duration, so these can contend — but at 100-1000 routes, a 1-5μs delay is negligible compared to the 50ms-3s liveness TX intervals. At 1M routes the ~5ms lock hold becomes measurable but is still small relative to liveness timers.

Practical impact on doublezerod CPU usage

Given a ~3% baseline CPU on a modern x86 core, this change adds effectively zero overhead at realistic route counts (low hundreds). The 1M route case is pathological for a doublezerod client and would have other scaling bottlenecks (BGP convergence, session state memory, netlink install throughput) long before reconciliation matters.

ben-dz
ben-dz previously requested changes Jun 15, 2026

@ben-dz ben-dz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The 0 disables kill switch doesn't work — Validate() rewrites 0 to the 30s default, so there's no way to turn this new periodic dataplane writer off during staged rollout.
  2. Excluded routes churn forever — on any host using a route exclude list, the reinstall counter and a Warn log fire every tick, permanently, defeating the metric's purpose.

Details inline.

PR-description corrections (not line-anchored):

  • The description claims a unit test for "incrementing the install failure metric on RouteAdd error". No such test exists — the PR adds three (ReinstallsMissing, SkipsPresent, SkipsUninstalled). Either add it (a few lines with the existing mock) or drop the claim.
  • It claims to add doublezero_liveness_route_install_failures_total. That metric already existed; this PR only adds a new increment site (the new metric is doublezero_liveness_route_reinstalls_total).
  • "Prevent TOCTOU race" overstates it — see the inline comment; the race is narrowed, not closed.

Comment thread client/doublezerod/internal/liveness/manager.go Outdated
Comment thread client/doublezerod/internal/liveness/manager.go
Comment thread client/doublezerod/internal/liveness/manager.go Outdated
Comment thread client/doublezerod/internal/liveness/manager.go
Comment thread client/doublezerod/internal/liveness/manager.go Outdated
nikw9944 added 2 commits June 17, 2026 17:30
- Let RouteReconcileInterval=0 disable reconciliation (restore the kill
  switch); drop the duplicate default constant in the liveness package.
- Skip excluded destinations in reconcileRoutes so they no longer churn
  the reinstall counter and logs every tick.
- Hold m.mu across the installed re-check and RouteAdd to close the
  reconcile/onSessionDown TOCTOU race.
- Match kernel routes by full destination prefix (Dst.String()) instead
  of IP only.
- Document the main-table assumption in Netlink.RouteByProtocol.
- Add tests for excluded-route skip, install-failure metric on reinstall
  error, and the 0-disables validation.
@nikw9944

Copy link
Copy Markdown
Contributor Author

@ben-dz thanks for the thorough review — all points addressed in 789f442, with per-comment replies inline. Summary:

Should-fix (inline):

  • Kill switchValidate() no longer rewrites 0; the > 0 guard now genuinely disables reconciliation, and the duplicate default constant is gone.
  • Excluded routesreconcileRoutes skips m.cr.IsExcluded(...) destinations, so they no longer churn the counter/logs.
  • TOCTOUm.mu is now held across the re-check and RouteAdd, closing the race (not just narrowing it).
  • Prefix match — kernel-set key uses the full *net.IPNet.String() prefix on both sides.
  • RouteByProtocol main-table assumption — documented with a NOTE and a pointer to add RT_FILTER_TABLE if ever used for tables 100/101.

PR-description corrections:

  • Added TestClient_Liveness_Manager_ReconcileRoutes_IncrementsInstallFailureMetric, so the install-failure-on-RouteAdd-error test claim is now real.
  • Description corrected: it now states the PR increments the existing doublezero_liveness_route_install_failures_total and adds doublezero_liveness_route_reinstalls_total.
  • "Prevent TOCTOU" reworded to "Close … by holding m.mu", which is now accurate given the lock is held across the syscall.

@nikw9944 nikw9944 self-assigned this Jun 17, 2026
nikw9944 added 3 commits June 17, 2026 20:05
The isis_global_state_latest / isis_overload_bit_latest assertions read the
views immediately after inserting, and under CI load the just-inserted rows
were not yet visible on the pooled read connection, returning 0 rows. Retry
the read until the expected rows appear so the test is deterministic.
A multi-row VALUES list with placeholders is not reliably bound by the
clickhouse-go database/sql driver and can silently drop rows, leaving the
isis_global_state_latest / isis_overload_bit_latest views empty so the test
times out waiting for 2 rows. Insert each row in its own single-row INSERT,
which the driver binds reliably.
The read polling added earlier was based on a misdiagnosis: rows appeared
missing not because of read-after-write visibility delay but because the
multi-row placeholder INSERT silently dropped rows. With single-row inserts the
acked data is immediately queryable, so revert selectAll to a direct read.

@ben-dz ben-dz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Post-fix review at af90168. The six fixes from the prior review round all hold up, and production correctness is sound for the IBRL case (verified Table=254/RT_TABLE_MAIN, Protocol=RTPROT_BGP, explicit Src round-trip; RouteAdd is idempotent RouteReplace). No critical/high issues. Three findings worth attention: (M1) the resurrection race is closed for onSessionDown but a symmetric narrow window remains on the passive-mode WithdrawRoute path, which deletes the kernel route before clearing installed — so the PR's "Prevent TOCTOU race" claim is accurate only for the onSessionDown ordering; (M2) the new reconcile tests are partly fictional — they use table 100 (production is 254, which RouteByProtocol filters out) and the SkipsPresent test returns the identical *Route pointer, so kernel-vs-desired key matching is never genuinely exercised; (L1) the RouteAdd syscall is run while holding m.mu, a deviation from the file's no-lock-across-syscall convention (a documented, reviewer-endorsed tradeoff).

Findings not anchored to the current diff:

  • client/doublezerod/internal/liveness/manager.go:451 — medium: Resurrection race still open on the passive-mode WithdrawRoute path. This branch issues RouteDelete (line 451) before clearing installed[rk] (line 463) — the opposite ordering from onSessionDown (clears at :838, deletes at :890), which is the ordering the TOCTOU fix depends on. If reconcile snapshots the route as installed, then passive WithdrawRoute runs its RouteDelete here but is preempted before acquiring the lock at :459, reconcile's kernel query (:945) sees the route missing, takes the lock at :991 with installed[rk] still true, and re-adds the route. WithdrawRoute then clears the maps, leaving a permanent stale kernel route the manager believes is withdrawn. Likelihood is low but the consequence is a stale dataplane route to a withdrawn destination. Fix: clear installed[rk]/desired[rk] under the lock before RouteDelete here, mirroring onSessionDown. The PR's "Prevent TOCTOU race" claim is accurate only for the onSessionDown ordering.

return true
}

func TestClient_Liveness_Manager_ReconcileRoutes_ReinstallsMissing(t *testing.T) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconcile tests don't reproduce production kernel filtering or representation. newTestRoute (main_test.go:71) defaults to Table:100, but liveness runs only in IBRL mode (RT_TABLE_MAIN=254, services/ibrl.go:64), and RouteByProtocol sets only RT_FILTER_PROTOCOL so it returns only main-table routes — a table-100 route would never come back from the real backend. SkipsPresent passes only because the mock returns the identical &r.Route pointer the manager installed, guaranteeing a key match regardless of table and bypassing real representation differences (4-byte vs 16-byte net.IP, prefsrc echo, mask normalization). Recommend using Table:254 and having the mock return a freshly-constructed *routing.Route with the same field values so key construction is genuinely exercised on both sides.

@ben-dz ben-dz dismissed their stale review June 18, 2026 17:46

Changes Addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Route installed by doublezerod removed by unknown process

2 participants