Control plane: cross-request OrgKelSnapshot cache (tip-SAID keyed), gated on a written fleet-scale target

## Context

The N+1 KEL replay in the list endpoints is fixed: `list_agents` / `list_fleet` now load one shared `OrgKelSnapshot` per request instead of one per member, and the invariant is pinned by a counting test double on `RegistryBackend::visit_events` (`crates/auths-api/tests/cases/kel_replay_count.rs` — full org-KEL replays must be member-count-independent and within a small constant budget, currently 3: the `list()` roster scan plus `OrgKelSnapshot::load`'s roster + event walk).

What remains is *cross-request* cost: every request still replays the org KEL from Git 3 times. Cost per request is O(KEL events), and KEL events grow with member churn (~2-3 events per add/revoke/rotate). Rough shape:

| Org scale | KEL events (approx) | Event reads per `list_fleet` request |
|---|---|---|
| 100 agents | ~300 | ~900 |
| 1,000 agents | ~3,000 | ~9,000 |
| 10,000 agents + churn | ~30,000 | ~90,000 |

## Decision needed first (do not skip)

Write down the launch-scale assumption (max agents per org, max KEL events) in `docs/architecture/`. This issue's work is **gated on a real org approaching that number** — building the cache speculatively adds a correctness surface for no measured benefit.

## Security constraint (non-negotiable)

Authority is decided by KEL replay, fail-closed. Any cache of KEL-derived state MUST be invalidated by **KEL tip SAID (content-addressed)**, never by TTL. A TTL'd entry is a revocation-latency hole: a revoked agent would stay authorized until expiry. Cache key = `(prefix, tip.said)`; every request does one cheap `get_tip()`; SAID mismatch → rebuild. This makes the cache correctness-neutral by construction.

## Implementation ladder

**Step 1 — in-process, no infra (~40 lines).** The control plane serves a single org (`AppState::ensure_org` pins one `org_prefix`), so the cache is one entry:

```rust
// on AppState
snapshot_cache: Arc<RwLock<Option<(Said /* tip */, Arc<OrgKelSnapshot>)>>>
```

Per request: `get_tip()` → if SAID matches, reuse the snapshot; else rebuild and swap. Hot-path cost drops from 3 full replays to 1 tip read. Update the budget constant + comment in `kel_replay_count.rs` deliberately (the test is designed to make this change visible).

**Step 2 — multi-process/multi-tenant only.** Port the Tier 0 plumbing from the archived `_archived/auths-cloud/crates/auths-cache/src/redis_cache.rs` (bb8 pool, `TierZeroCache` trait shape; the current `RegistryBackend::write_key_state` docs still describe this Redis-Tier-0/Git-Tier-1 write-through). Two required changes when porting:
- Replace `set_ex` TTL expiry with tip-SAID keying (see constraint above).
- Cache `OrgKelSnapshot` (roster + events), not `KeyState` — the fleet replays are raw event walks, so `KeyState` caching removes none of them.

Explicitly **do not** import the archived stack's "Redis as source of truth" principle — that was the deleted bearer-token model; here the KEL is the source of truth and the cache is strictly subordinate.

## Acceptance

- [ ] Launch-scale assumption documented (agents/org, KEL events) with this issue linked
- [ ] Step 1 only: single-entry tip-keyed snapshot cache on `AppState`; `kel_replay_count.rs` budget updated to 1 tip read + ≤1 full replay (cold) and still member-count-independent
- [ ] A revocation test: revoke an agent, immediately list — the revoked agent reports `revoked: true` on the very next request (no stale window)
- [ ] Step 2 tracked separately if/when multi-process serving lands

## References

- `crates/auths-api/src/control_plane.rs` (`list_agents`, `list_fleet`)
- `crates/auths-sdk/src/domains/org/delegation.rs` (`OrgKelSnapshot`, `OrgSnapshotCache`)
- `crates/auths-api/tests/cases/kel_replay_count.rs` (regression guard + budget constant)
- `docs/plans/audit.md` — finding D1, done-signal 5, Open Question 5
- `_archived/auths-cloud/crates/auths-cache/` (salvage source for Step 2)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control plane: cross-request OrgKelSnapshot cache (tip-SAID keyed), gated on a written fleet-scale target #254

Context

Decision needed first (do not skip)

Security constraint (non-negotiable)

Implementation ladder

Acceptance

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Org scale	KEL events (approx)	Event reads per `list_fleet` request
100 agents	~300	~900
1,000 agents	~3,000	~9,000
10,000 agents + churn	~30,000	~90,000

Control plane: cross-request OrgKelSnapshot cache (tip-SAID keyed), gated on a written fleet-scale target #254

Description

Context

Decision needed first (do not skip)

Security constraint (non-negotiable)

Implementation ladder

Acceptance

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions