Skip to content

Control plane: cross-request OrgKelSnapshot cache (tip-SAID keyed), gated on a written fleet-scale target #254

@bordumb

Description

@bordumb

Context

The N+1 KEL replay in the list endpoints is fixed: list_agents / list_fleet now load one shared OrgKelSnapshot per request instead of one per member, and the invariant is pinned by a counting test double on RegistryBackend::visit_events (crates/auths-api/tests/cases/kel_replay_count.rs — full org-KEL replays must be member-count-independent and within a small constant budget, currently 3: the list() roster scan plus OrgKelSnapshot::load's roster + event walk).

What remains is cross-request cost: every request still replays the org KEL from Git 3 times. Cost per request is O(KEL events), and KEL events grow with member churn (~2-3 events per add/revoke/rotate). Rough shape:

Org scale KEL events (approx) Event reads per list_fleet request
100 agents ~300 ~900
1,000 agents ~3,000 ~9,000
10,000 agents + churn ~30,000 ~90,000

Decision needed first (do not skip)

Write down the launch-scale assumption (max agents per org, max KEL events) in docs/architecture/. This issue's work is gated on a real org approaching that number — building the cache speculatively adds a correctness surface for no measured benefit.

Security constraint (non-negotiable)

Authority is decided by KEL replay, fail-closed. Any cache of KEL-derived state MUST be invalidated by KEL tip SAID (content-addressed), never by TTL. A TTL'd entry is a revocation-latency hole: a revoked agent would stay authorized until expiry. Cache key = (prefix, tip.said); every request does one cheap get_tip(); SAID mismatch → rebuild. This makes the cache correctness-neutral by construction.

Implementation ladder

Step 1 — in-process, no infra (~40 lines). The control plane serves a single org (AppState::ensure_org pins one org_prefix), so the cache is one entry:

// on AppState
snapshot_cache: Arc<RwLock<Option<(Said /* tip */, Arc<OrgKelSnapshot>)>>>

Per request: get_tip() → if SAID matches, reuse the snapshot; else rebuild and swap. Hot-path cost drops from 3 full replays to 1 tip read. Update the budget constant + comment in kel_replay_count.rs deliberately (the test is designed to make this change visible).

Step 2 — multi-process/multi-tenant only. Port the Tier 0 plumbing from the archived _archived/auths-cloud/crates/auths-cache/src/redis_cache.rs (bb8 pool, TierZeroCache trait shape; the current RegistryBackend::write_key_state docs still describe this Redis-Tier-0/Git-Tier-1 write-through). Two required changes when porting:

  • Replace set_ex TTL expiry with tip-SAID keying (see constraint above).
  • Cache OrgKelSnapshot (roster + events), not KeyState — the fleet replays are raw event walks, so KeyState caching removes none of them.

Explicitly do not import the archived stack's "Redis as source of truth" principle — that was the deleted bearer-token model; here the KEL is the source of truth and the cache is strictly subordinate.

Acceptance

  • Launch-scale assumption documented (agents/org, KEL events) with this issue linked
  • Step 1 only: single-entry tip-keyed snapshot cache on AppState; kel_replay_count.rs budget updated to 1 tip read + ≤1 full replay (cold) and still member-count-independent
  • A revocation test: revoke an agent, immediately list — the revoked agent reports revoked: true on the very next request (no stale window)
  • Step 2 tracked separately if/when multi-process serving lands

References

  • crates/auths-api/src/control_plane.rs (list_agents, list_fleet)
  • crates/auths-sdk/src/domains/org/delegation.rs (OrgKelSnapshot, OrgSnapshotCache)
  • crates/auths-api/tests/cases/kel_replay_count.rs (regression guard + budget constant)
  • docs/plans/audit.md — finding D1, done-signal 5, Open Question 5
  • _archived/auths-cloud/crates/auths-cache/ (salvage source for Step 2)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions