oauth: H-2 — refresh-token reuse detection (gating mode) by BorisTyshkevich · Pull Request #106 · Altinity/altinity-mcp

BorisTyshkevich · 2026-05-07T17:17:08Z

v1 scope (2026-05-09): deferred, not dropped. Product direction for v1 is pure resource-only OAuth — MCP advertises an external AS (Auth0 / Authentik / Keycloak) via RFC 9728 resource metadata; the AS owns DCR, /authorize, /token, refresh rotation, reuse detection, and revocation. claude.ai and ChatGPT DCR with the AS directly; MCP validates incoming AS-issued JWTs (signature + RFC 8707 audience byte-equality + expiry) and authorizes per-tool scopes.

Under that v1 architecture, the custom token machinery this issue/PR introduces is unnecessary — the AS handles it. The work here becomes load-bearing again the moment MCP re-enters the AS business: to add a role-picker consent screen (the headline UX from #107 that Auth0's stock consent doesn't provide), to support deployments whose IdP lacks DCR (Google direct, basic-tier Auth0), or to layer MCP-side grant lifecycle management on top of the AS (#108).

Treat this as v2/v3 follow-up: the design is sound, the security analysis stands, the implementation will be needed when v1's trade-offs (no consent-time role binding, AS-tier DCR requirement, AS-bound revocation latency) become product-relevant.

Summary

Closes #103.

H-2 from the OAuth security review: refresh-token reuse detection in gating mode. Embeds jti + family_id into MCP-issued refresh JWEs and persists consumed jtis + revoked families in two ClickHouse tables. When a previously-redeemed jti is replayed, the entire token family is invalidated and the user re-authenticates.

This is the third of a 3-PR stack:

PR oauth: spec-compliance hardening + HKDF + C-1 forward-mode validation #104 — spec-compliance hardening + HKDF + C-1
PR oauth: H-1 — refuse to start when gating + cluster_secret without require_email_verified #105 — H-1 require_email_verified gate
This PR (PR-C) — H-2 refresh-token reuse detection

Note on PR base: Stacked on top of #105 (which is stacked on #104). Until both predecessors merge, the diff shown is against feature/oauth-require-email-verified. Once they land, GitHub auto-rebases this onto main and the diff narrows to the H-2 commit alone.

Threat closed

Without H-2, a captured refresh JWE can be redeemed many times in parallel — each redemption mints a fresh access+refresh pair until the JWE's exp (default 30 days). An attacker who briefly captures a refresh token (leaked log, intermediate proxy, browser-extension compromise) gets a silent 30-day window of access against the legitimate user's identity. The legitimate user has no signal.

With H-2, the moment the legitimate client refreshes after the attacker (or vice-versa), the loser's jti is in the consumed-set, the family is revoked, both parties are rejected on subsequent attempts, and the user re-auths once. The auth event is a clear signal.

OAuth 2.1 §4.13.2 and MCP authorization 2025-11-25 require this. RFC 6749 §10.4 frames the same requirement at SHOULD level for OAuth 2.0.

Approach

Claim	Lifecycle
`jti`	Fresh 16-byte random hex per issuance. Different every refresh.
`family_id`	Fresh 16-byte random hex at code→token exchange. Stable across the entire rotation chain.

Two ClickHouse tables in a new altinity database:

Table	Purpose	Lookup key
`altinity.oauth_refresh_consumed_jtis`	Every redeemed refresh-token jti	`jti`
`altinity.oauth_refresh_revoked_families`	Families flagged after reuse detection	`family_id`

Both [Replicated]MergeTree with TTL ... + INTERVAL 35 DAY for storage bound. Engine-agnostic Go code: SELECT and INSERT work the same on both.

Per refresh: 1 combined SELECT count() (two subqueries) + 1 INSERT. At realistic load (~hundreds of clients, refresh ~1×/h) ≈ 0.5 qps cluster-wide. Lookups happen only on grant_type=refresh_token — never on regular MCP requests (those validate access tokens locally via HMAC).

Out of scope (intentionally):

Forward mode. Auth0 itself rotates and detects reuse upstream; our wrapper stays stateless to keep horizontal scaling cheap. Startup validation refuses mode: forward + the flag.
Single-node KV stores. EmbeddedRocksDB doesn't replicate across MCP pods → reuse window opens between instances. KeeperMap couples OAuth correctness to Keeper ensemble health. Plain [Replicated]MergeTree is the right tool at this query volume.

Operator prerequisite

Run docs/sql/oauth-state.sql (clustered or single-node flavor) as a CH admin user before flipping oauth.refresh_revokes_tracking: true in helm values. Creates the altinity database, both tables, and GRANT INSERT, SELECT ON altinity.* TO mcp_service.

When the flag is on:

Startup refuses to boot if mode: forward or clickhouse.read_only: true
The CH-side mcp_service user cannot have a READONLY=1 profile (operator's responsibility — verify with system.users / system.settings_profiles)

Cluster name all-replicated is convention for a single logical shard spanning all replicas (sharding-compatible). Documented in docs/oauth-refresh-reuse-detection.md.

Legacy-token policy

Refresh tokens issued before deploy lack family_id/jti → rejected with invalid_grant on first redemption. Clients re-authenticate once. Auto-promotion was rejected because it would let a captured pre-deploy token be replayed exactly once before tracking starts.

Failure modes — hard fail with ERR

Every CH-state failure path returns HTTP 500 server_error and emits an ERR-level zerolog line. No silent fallthrough. Detailed table in docs/oauth-refresh-reuse-detection.md § Failure modes.

Live deployment

Running on otel-mcp.demo.altinity.cloud since this commit landed (image wip-oauth-h2-81b69b3):

✅ Pod healthy, H-2 startup INF log fires
✅ State tables created with INSERT, SELECT grants on mcp_service (NOT readonly profile)
✅ Tool calls work end-to-end through claude.ai connector — currentUser() returns the impersonated email, confirming H-1 + H-2 + cluster_secret chain intact
✅ Legacy-token rejection observed (the documented "Anthropic Proxy: Invalid content from server" surfacing when pre-H-2 connectors first hit the new endpoint)
⏳ First consumed_jti row will land naturally when the otel connector's current 1h-TTL access token expires (token issued before H-2 deploy, exp ~17:28 UTC)

Test plan

pkg/oauth_state unit tests cover happy-path / replay-revokes-family / legacy-token-rejected / state-unreachable / config-rejection (forward+flag, read_only+flag, empty database+flag)
In-memory fake oauth_state.Store exercises the refresh-handler control flow without standing up a CH harness
Live deployment on otel-mcp end-to-end smoke
CI green on the full stacked diff
Reviewer eyes on the consumed-jti / revoked-family TTL window (35 days vs 30-day refresh-token TTL — buffer for clock skew + replay windows)
Negative replay test against /oauth/token with a captured refresh JWE (deferred — easy to drive once anyone wants to extract a refresh JWE from claude.ai's local cache)

docs/oauth-refresh-reuse-detection.md — full design + threat model + DDL examples + grants + failure modes + rollback
docs/sql/oauth-state.sql — operator DDL (clustered + single-node flavors)
/Users/Workspaces/acm/mcp/.wiki/mcp-oauth-debugging.md § H-2 — operator wiki rollout checklist + per-deployment ops

🤖 Generated with Claude Code

Refs #103. Embed jti + family_id into MCP-issued refresh JWEs in gating mode; persist consumed jtis + revoked families in two ClickHouse tables in the new `altinity` database; revoke the entire family when a previously -redeemed jti is replayed. This closes the silent 30-day replay window for captured gating-mode refresh tokens (OAuth 2.1 §4.13.2, MCP auth 2025-11-25). Opt-in via oauth.refresh_revokes_tracking (default false). Operators must run docs/sql/oauth-state.sql + GRANT INSERT,SELECT on altinity.* to mcp_service before flipping the flag — startup refuses to boot if the flag is on with mode: forward, clickhouse.read_only: true, or an empty clickhouse.database. Hard fail with ERR-level zerolog on every CH error path; never silently mints a new pair when state lookup fails. Legacy refresh tokens (lacking family_id/jti) are rejected with invalid_grant on first redemption after deploy; clients re-auth once. Auto-promotion was rejected because it would let a captured pre-deploy token be replayed exactly once before the family starts tracking. Forward mode is intentionally out of scope: Auth0 itself rotates and detects reuse upstream, and our wrapper stays stateless to keep horizontal scaling cheap. Tests cover: HappyPath, ReplayRevokesFamily, LegacyTokenRejected, StateUnreachable, plus three startup-validation rejections (forward mode, read_only=true, empty database). All run with an in-memory fake oauth_state.Store; live-CH SQL roundtrip is validated by the otel-mcp deployment's negative replay test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pre-merge external review of #106 (H-2) flagged that the previous SELECT-then-INSERT in pkg/oauth_state/store.go is not race-safe: two parallel redemptions of the same captured refresh JWE both observe count()=0 on the SELECT, both succeed at the INSERT, and the family forks before any reuse is detected. From that point an attacker who stole one token has their own valid branch of the family chain and never needs the original token again. RFC 9700 §refresh-token rotation requires atomic *consume-before-mint*: a check-then-act pattern where the second redeemer cannot see "not consumed" while the first hasn't yet committed is the only model that prevents the parallel-replay fork. The fix uses ClickHouse KeeperMap with `keeper_map_strict_mode=1` — a Raft-linearizable KV that throws a duplicate-key exception on collision. Concurrent INSERTs are serialised through the Keeper leader; exactly one wins, the rest receive `KEEPER_EXCEPTION: Transaction failed (Node exists)` (Code 999). The losing pods record the family revocation and reject the request. The earlier H-2 plan rejected KeeperMap with the rationale "couples OAuth correctness to Keeper". That rationale was wrong: ReplicatedMergeTree is already on Keeper for replication coordination. KeeperMap doesn't add coupling — it exposes the linearizable primitive Keeper already provides. Implementation: * pkg/oauth_state/store.go reshape: cheap revoked-family check first (point lookup on ReplicatedMergeTree), then atomic INSERT into the KeeperMap consumed-jtis table with `SETTINGS keeper_map_strict_mode=1` on the statement (table-level SETTINGS is silently ignored — verified empirically against CH 26.1.6). On duplicate-key exception, record the family revocation best-effort and return ErrRefreshReused. The revoke-table INSERT is non-fatal: KeeperMap atomicity alone is what enforces single-claim; revoked_families is the audit trail. * isKeeperMapDuplicateKeyError matches the exact Keeper error string ("Transaction failed (Node exists)") observed against 26.1.6 stock and 26.1.11.20001.altinityantalya. The phrase is specific to Keeper Raft create-if-absent rejections; other Keeper errors (timeout, leader-loss) produce different messages. * New pkg/oauth_state/cleanup.go runs an in-process goroutine on an hourly ticker issuing `ALTER TABLE altinity.oauth_refresh_consumed_jtis DELETE WHERE consumed_at < now() - INTERVAL 35 DAY`. KeeperMap doesn't support CH-native TTL, so cleanup is application-side. Multi-pod deployments all run their own loops — duplicate ALTER DELETEs are harmless. Failures log at WARN, never panic; cleanup failure does not affect the security control (KeeperMap atomicity does). * pkg/server/server.go starts the cleanup loop when RefreshRevokesTracking is enabled. serverCleanupRunner adapter reads s.refreshStateStore on each tick so SetRefreshStateStore (test-only) takes effect on the cleanup loop too. * docs/sql/oauth-state.sql switches consumed_jtis to ENGINE = KeeperMap('/altinity_mcp/oauth_refresh_consumed_jtis') PRIMARY KEY jti. revoked_families stays ReplicatedMergeTree (INSERTs are idempotent for revocation; CH-native TTL handles cleanup). Adds ALTER DELETE to the mcp_service grants for the cleanup goroutine. * docs/oauth-refresh-reuse-detection.md and the operator wiki (acm/mcp/.wiki/mcp-oauth-debugging.md) document the new keeper_map_path_prefix CH config drop-in prerequisite, the query-level SETTINGS load-bearing constraint, and the engine-mixed storage shape. Tests: * TestOAuthRefreshReuseDetection_AtomicConcurrentClaim hammers the refresh handler with 50 goroutines redeeming the SAME refresh JWE in parallel via a barrier release. Asserts exactly 1 success and 49 invalid_grant with reuse_detected. The fake store synchronises via sync.Mutex (faithful to KeeperMap's exactly-one-winner semantics). This is the regression test for the parallel-replay race. * fakeRefreshStateStore extended with no-op Cleanup to satisfy the Store interface change. Verified against otel CH cluster 9572 during development: * keeper_map_path_prefix added via acmctl setting create + push * DROP+CREATE migrated consumed_jtis to KeeperMap * GRANT INSERT, SELECT, ALTER DELETE applied * Smoke INSERT-twice produced KEEPER_EXCEPTION as expected Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eplay) Two scripts + README covering refresh-token reuse detection end-to-end against a live deployment, complementing the unit test TestOAuthRefreshReuseDetection_AtomicConcurrentClaim: - scripts/qa/h2-replay-test.sh — drives the OAuth dance interactively (DCR + Auth0 browser login captured by an inline python http.server on localhost:8910), then exercises the sequential reuse-detection path: refresh R0 (expect 200 + R1), replay R0 (expect 400 invalid_grant reuse_detected), bonus replay R1 (expect 400 too — family-wide revocation per RFC 9700). - scripts/qa/h2-parallel-test.sh — same OAuth dance, then fires N concurrent redemptions of the same R0 via backgrounded curl + wait. Asserts exactly 1 × 200 OK + N-1 × 400 invalid_grant; non-zero exit on anomaly. This is the test that proves KeeperMap strict mode's atomicity: the SELECT-then-INSERT design would fail this; the new code shouldn't. Verified passing against otel-mcp during development. - scripts/qa/README.md — when to run, prerequisites (jq + openssl + python3 + browser), state-table verification SQL, failure-mode triage. Verified against otel-mcp (cd65134, oauth-h2-atomic image): - replay-test → consumed_jtis row (R0's jti) + revoked_families row (family_id, reason=reuse_detected); R1 also rejected - parallel-test (N=50) → exactly 1 × 200, 49 × 400 reuse_detected; consumed_jtis +1, revoked_families +1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…im re-check Second-pass review of #106 flagged that the consumed-jti claim was made atomic via KeeperMap, but the revoked-families table stayed ReplicatedMergeTree and was checked on every refresh. ReplicatedMergeTree replication is asynchronous: a recent revoke INSERT on pod A can be invisible to pod B's read on a different CH replica for some replication lag. That window lets the winner of a forked family keep refreshing on different pods/replicas while the revoke catches up — the same parallel- replay window the KeeperMap consumed-jti claim was supposed to close. Three fixes: 1. revoked_families becomes KeeperMap, no strict mode (idempotent overwrite — parallel losers writing the same family_id is fine). KeeperMap reads are linearizable through Keeper Raft, so a revoke write on pod A is visible to pod B's next pre-check regardless of CH replica routing. 2. Revoke INSERT failures hard-fail instead of WARN-and-return. The previous code logged WARN and returned ErrRefreshReused even if the revoke row didn't land — leaving the winning fork alive on every subsequent refresh. RFC 9700 family-wide revocation must be authoritative, not best-effort. On revoke INSERT failure we now return a generic error which the handler turns into HTTP 500 server_error; operators see an ERR-level "SECURITY: reuse detected but revoke INSERT failed" log and page on the underlying Keeper or grant issue. 3. Post-claim revoked re-check. Step 1 (pre-check) and step 2 (atomic claim) are not a single Keeper transaction — there's a microsecond TOCTOU window where another pod can revoke between them. After a successful claim we re-check revocation; if the family was revoked during the window we return ErrRefreshReused (the consumed-jti slot is spent but no token is minted). This was the residual concern that even with KeeperMap-backed revoked_families wasn't fully closed by the pre-check alone. Schema migration applied to otel CHI 9572: DROP + ReplicatedMergeTree → CREATE + KeeperMap. 2 existing audit rows discarded; the affected families' refresh tokens are still rejected by the consumed-jti pre-check on the new path. Cleanup goroutine now runs ALTER TABLE … DELETE on both KeeperMap tables on the same hourly ticker. Failures on either are wrapped and returned together so observability isn't masked. Tests: * TestOAuthRefreshReuseDetection_RevokeInsertFailureHardFails pins the new hard-fail behavior. The fake store grew a `failRevoke` knob: when armed, the duplicate-detected branch returns a generic error instead of ErrRefreshReused, simulating a Keeper write failure on the revoke. The handler must respond with 500 server_error, NOT 400 invalid_grant — otherwise the regression silently restores the "winning fork stays alive" hole. * Existing TestOAuthRefreshReuseDetection_AtomicConcurrentClaim (50 concurrent redeemers, exactly 1 winner) still passes 10× runs + race detector. The fake's reuse-detected branch now also defaults to recording the revoke before returning ErrRefreshReused, matching the new production semantics. * Documentation updated in docs/oauth-refresh-reuse-detection.md ("Why KeeperMap (for both tables)", "Revoke must persist", "Post- claim revocation re-check") and the operator wiki to call out the KeeperMap-on-both-tables shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

BorisTyshkevich force-pushed the feature/oauth-require-email-verified branch from 804528d to 5302328 Compare May 7, 2026 17:38

BorisTyshkevich force-pushed the feature/oauth-refresh-reuse-detection branch from 6e6598c to 0f3318d Compare May 7, 2026 17:38

BorisTyshkevich and others added 3 commits May 8, 2026 13:44

This was referenced May 8, 2026

oauth: DCR consent and bind tokens to roles #107

Closed

oauth: user and admin consent-grant management UI #108

Closed

oauth: gating mode simplification #109

Closed

oauth: redefine gating mode as pure OAuth resource server (#109) #110

Merged

Slach changed the base branch from feature/oauth-require-email-verified to main May 12, 2026 07:43

BorisTyshkevich closed this May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

oauth: H-2 — refresh-token reuse detection (gating mode)#106

oauth: H-2 — refresh-token reuse detection (gating mode)#106
BorisTyshkevich wants to merge 4 commits into
mainfrom
feature/oauth-refresh-reuse-detection

BorisTyshkevich commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BorisTyshkevich commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Threat closed

Approach

Operator prerequisite

Legacy-token policy

Failure modes — hard fail with ERR

Live deployment

Test plan

See also

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

BorisTyshkevich commented May 7, 2026 •

edited

Loading