Context
The N+1 KEL replay in the list endpoints is fixed: list_agents / list_fleet now load one shared OrgKelSnapshot per request instead of one per member, and the invariant is pinned by a counting test double on RegistryBackend::visit_events (crates/auths-api/tests/cases/kel_replay_count.rs — full org-KEL replays must be member-count-independent and within a small constant budget, currently 3: the list() roster scan plus OrgKelSnapshot::load's roster + event walk).
What remains is cross-request cost: every request still replays the org KEL from Git 3 times. Cost per request is O(KEL events), and KEL events grow with member churn (~2-3 events per add/revoke/rotate). Rough shape:
| Org scale |
KEL events (approx) |
Event reads per list_fleet request |
| 100 agents |
~300 |
~900 |
| 1,000 agents |
~3,000 |
~9,000 |
| 10,000 agents + churn |
~30,000 |
~90,000 |
Decision needed first (do not skip)
Write down the launch-scale assumption (max agents per org, max KEL events) in docs/architecture/. This issue's work is gated on a real org approaching that number — building the cache speculatively adds a correctness surface for no measured benefit.
Security constraint (non-negotiable)
Authority is decided by KEL replay, fail-closed. Any cache of KEL-derived state MUST be invalidated by KEL tip SAID (content-addressed), never by TTL. A TTL'd entry is a revocation-latency hole: a revoked agent would stay authorized until expiry. Cache key = (prefix, tip.said); every request does one cheap get_tip(); SAID mismatch → rebuild. This makes the cache correctness-neutral by construction.
Implementation ladder
Step 1 — in-process, no infra (~40 lines). The control plane serves a single org (AppState::ensure_org pins one org_prefix), so the cache is one entry:
// on AppState
snapshot_cache: Arc<RwLock<Option<(Said /* tip */, Arc<OrgKelSnapshot>)>>>
Per request: get_tip() → if SAID matches, reuse the snapshot; else rebuild and swap. Hot-path cost drops from 3 full replays to 1 tip read. Update the budget constant + comment in kel_replay_count.rs deliberately (the test is designed to make this change visible).
Step 2 — multi-process/multi-tenant only. Port the Tier 0 plumbing from the archived _archived/auths-cloud/crates/auths-cache/src/redis_cache.rs (bb8 pool, TierZeroCache trait shape; the current RegistryBackend::write_key_state docs still describe this Redis-Tier-0/Git-Tier-1 write-through). Two required changes when porting:
- Replace
set_ex TTL expiry with tip-SAID keying (see constraint above).
- Cache
OrgKelSnapshot (roster + events), not KeyState — the fleet replays are raw event walks, so KeyState caching removes none of them.
Explicitly do not import the archived stack's "Redis as source of truth" principle — that was the deleted bearer-token model; here the KEL is the source of truth and the cache is strictly subordinate.
Acceptance
References
crates/auths-api/src/control_plane.rs (list_agents, list_fleet)
crates/auths-sdk/src/domains/org/delegation.rs (OrgKelSnapshot, OrgSnapshotCache)
crates/auths-api/tests/cases/kel_replay_count.rs (regression guard + budget constant)
docs/plans/audit.md — finding D1, done-signal 5, Open Question 5
_archived/auths-cloud/crates/auths-cache/ (salvage source for Step 2)
Context
The N+1 KEL replay in the list endpoints is fixed:
list_agents/list_fleetnow load one sharedOrgKelSnapshotper request instead of one per member, and the invariant is pinned by a counting test double onRegistryBackend::visit_events(crates/auths-api/tests/cases/kel_replay_count.rs— full org-KEL replays must be member-count-independent and within a small constant budget, currently 3: thelist()roster scan plusOrgKelSnapshot::load's roster + event walk).What remains is cross-request cost: every request still replays the org KEL from Git 3 times. Cost per request is O(KEL events), and KEL events grow with member churn (~2-3 events per add/revoke/rotate). Rough shape:
list_fleetrequestDecision needed first (do not skip)
Write down the launch-scale assumption (max agents per org, max KEL events) in
docs/architecture/. This issue's work is gated on a real org approaching that number — building the cache speculatively adds a correctness surface for no measured benefit.Security constraint (non-negotiable)
Authority is decided by KEL replay, fail-closed. Any cache of KEL-derived state MUST be invalidated by KEL tip SAID (content-addressed), never by TTL. A TTL'd entry is a revocation-latency hole: a revoked agent would stay authorized until expiry. Cache key =
(prefix, tip.said); every request does one cheapget_tip(); SAID mismatch → rebuild. This makes the cache correctness-neutral by construction.Implementation ladder
Step 1 — in-process, no infra (~40 lines). The control plane serves a single org (
AppState::ensure_orgpins oneorg_prefix), so the cache is one entry:Per request:
get_tip()→ if SAID matches, reuse the snapshot; else rebuild and swap. Hot-path cost drops from 3 full replays to 1 tip read. Update the budget constant + comment inkel_replay_count.rsdeliberately (the test is designed to make this change visible).Step 2 — multi-process/multi-tenant only. Port the Tier 0 plumbing from the archived
_archived/auths-cloud/crates/auths-cache/src/redis_cache.rs(bb8 pool,TierZeroCachetrait shape; the currentRegistryBackend::write_key_statedocs still describe this Redis-Tier-0/Git-Tier-1 write-through). Two required changes when porting:set_exTTL expiry with tip-SAID keying (see constraint above).OrgKelSnapshot(roster + events), notKeyState— the fleet replays are raw event walks, soKeyStatecaching removes none of them.Explicitly do not import the archived stack's "Redis as source of truth" principle — that was the deleted bearer-token model; here the KEL is the source of truth and the cache is strictly subordinate.
Acceptance
AppState;kel_replay_count.rsbudget updated to 1 tip read + ≤1 full replay (cold) and still member-count-independentrevoked: trueon the very next request (no stale window)References
crates/auths-api/src/control_plane.rs(list_agents,list_fleet)crates/auths-sdk/src/domains/org/delegation.rs(OrgKelSnapshot,OrgSnapshotCache)crates/auths-api/tests/cases/kel_replay_count.rs(regression guard + budget constant)docs/plans/audit.md— finding D1, done-signal 5, Open Question 5_archived/auths-cloud/crates/auths-cache/(salvage source for Step 2)