Skip to content

oauth: H-2 — refresh-token reuse detection (gating mode)#106

Closed
BorisTyshkevich wants to merge 4 commits into
mainfrom
feature/oauth-refresh-reuse-detection
Closed

oauth: H-2 — refresh-token reuse detection (gating mode)#106
BorisTyshkevich wants to merge 4 commits into
mainfrom
feature/oauth-refresh-reuse-detection

Conversation

@BorisTyshkevich
Copy link
Copy Markdown
Collaborator

@BorisTyshkevich BorisTyshkevich commented May 7, 2026

v1 scope (2026-05-09): deferred, not dropped. Product direction for v1 is pure resource-only OAuth — MCP advertises an external AS (Auth0 / Authentik / Keycloak) via RFC 9728 resource metadata; the AS owns DCR, /authorize, /token, refresh rotation, reuse detection, and revocation. claude.ai and ChatGPT DCR with the AS directly; MCP validates incoming AS-issued JWTs (signature + RFC 8707 audience byte-equality + expiry) and authorizes per-tool scopes.

Under that v1 architecture, the custom token machinery this issue/PR introduces is unnecessary — the AS handles it. The work here becomes load-bearing again the moment MCP re-enters the AS business: to add a role-picker consent screen (the headline UX from #107 that Auth0's stock consent doesn't provide), to support deployments whose IdP lacks DCR (Google direct, basic-tier Auth0), or to layer MCP-side grant lifecycle management on top of the AS (#108).

Treat this as v2/v3 follow-up: the design is sound, the security analysis stands, the implementation will be needed when v1's trade-offs (no consent-time role binding, AS-tier DCR requirement, AS-bound revocation latency) become product-relevant.


Summary

Closes #103.

H-2 from the OAuth security review: refresh-token reuse detection in gating mode. Embeds jti + family_id into MCP-issued refresh JWEs and persists consumed jtis + revoked families in two ClickHouse tables. When a previously-redeemed jti is replayed, the entire token family is invalidated and the user re-authenticates.

This is the third of a 3-PR stack:

  1. PR oauth: spec-compliance hardening + HKDF + C-1 forward-mode validation #104 — spec-compliance hardening + HKDF + C-1
  2. PR oauth: H-1 — refuse to start when gating + cluster_secret without require_email_verified #105 — H-1 require_email_verified gate
  3. This PR (PR-C) — H-2 refresh-token reuse detection

Note on PR base: Stacked on top of #105 (which is stacked on #104). Until both predecessors merge, the diff shown is against feature/oauth-require-email-verified. Once they land, GitHub auto-rebases this onto main and the diff narrows to the H-2 commit alone.

Threat closed

Without H-2, a captured refresh JWE can be redeemed many times in parallel — each redemption mints a fresh access+refresh pair until the JWE's exp (default 30 days). An attacker who briefly captures a refresh token (leaked log, intermediate proxy, browser-extension compromise) gets a silent 30-day window of access against the legitimate user's identity. The legitimate user has no signal.

With H-2, the moment the legitimate client refreshes after the attacker (or vice-versa), the loser's jti is in the consumed-set, the family is revoked, both parties are rejected on subsequent attempts, and the user re-auths once. The auth event is a clear signal.

OAuth 2.1 §4.13.2 and MCP authorization 2025-11-25 require this. RFC 6749 §10.4 frames the same requirement at SHOULD level for OAuth 2.0.

Approach

Claim Lifecycle
jti Fresh 16-byte random hex per issuance. Different every refresh.
family_id Fresh 16-byte random hex at code→token exchange. Stable across the entire rotation chain.

Two ClickHouse tables in a new altinity database:

Table Purpose Lookup key
altinity.oauth_refresh_consumed_jtis Every redeemed refresh-token jti jti
altinity.oauth_refresh_revoked_families Families flagged after reuse detection family_id

Both [Replicated]MergeTree with TTL ... + INTERVAL 35 DAY for storage bound. Engine-agnostic Go code: SELECT and INSERT work the same on both.

Per refresh: 1 combined SELECT count() (two subqueries) + 1 INSERT. At realistic load (~hundreds of clients, refresh ~1×/h) ≈ 0.5 qps cluster-wide. Lookups happen only on grant_type=refresh_token — never on regular MCP requests (those validate access tokens locally via HMAC).

Out of scope (intentionally):

  • Forward mode. Auth0 itself rotates and detects reuse upstream; our wrapper stays stateless to keep horizontal scaling cheap. Startup validation refuses mode: forward + the flag.
  • Single-node KV stores. EmbeddedRocksDB doesn't replicate across MCP pods → reuse window opens between instances. KeeperMap couples OAuth correctness to Keeper ensemble health. Plain [Replicated]MergeTree is the right tool at this query volume.

Operator prerequisite

Run docs/sql/oauth-state.sql (clustered or single-node flavor) as a CH admin user before flipping oauth.refresh_revokes_tracking: true in helm values. Creates the altinity database, both tables, and GRANT INSERT, SELECT ON altinity.* TO mcp_service.

When the flag is on:

  • Startup refuses to boot if mode: forward or clickhouse.read_only: true
  • The CH-side mcp_service user cannot have a READONLY=1 profile (operator's responsibility — verify with system.users / system.settings_profiles)

Cluster name all-replicated is convention for a single logical shard spanning all replicas (sharding-compatible). Documented in docs/oauth-refresh-reuse-detection.md.

Legacy-token policy

Refresh tokens issued before deploy lack family_id/jti → rejected with invalid_grant on first redemption. Clients re-authenticate once. Auto-promotion was rejected because it would let a captured pre-deploy token be replayed exactly once before tracking starts.

Failure modes — hard fail with ERR

Every CH-state failure path returns HTTP 500 server_error and emits an ERR-level zerolog line. No silent fallthrough. Detailed table in docs/oauth-refresh-reuse-detection.md § Failure modes.

Live deployment

Running on otel-mcp.demo.altinity.cloud since this commit landed (image wip-oauth-h2-81b69b3):

  • ✅ Pod healthy, H-2 startup INF log fires
  • ✅ State tables created with INSERT, SELECT grants on mcp_service (NOT readonly profile)
  • ✅ Tool calls work end-to-end through claude.ai connector — currentUser() returns the impersonated email, confirming H-1 + H-2 + cluster_secret chain intact
  • ✅ Legacy-token rejection observed (the documented "Anthropic Proxy: Invalid content from server" surfacing when pre-H-2 connectors first hit the new endpoint)
  • ⏳ First consumed_jti row will land naturally when the otel connector's current 1h-TTL access token expires (token issued before H-2 deploy, exp ~17:28 UTC)

Test plan

  • pkg/oauth_state unit tests cover happy-path / replay-revokes-family / legacy-token-rejected / state-unreachable / config-rejection (forward+flag, read_only+flag, empty database+flag)
  • In-memory fake oauth_state.Store exercises the refresh-handler control flow without standing up a CH harness
  • Live deployment on otel-mcp end-to-end smoke
  • CI green on the full stacked diff
  • Reviewer eyes on the consumed-jti / revoked-family TTL window (35 days vs 30-day refresh-token TTL — buffer for clock skew + replay windows)
  • Negative replay test against /oauth/token with a captured refresh JWE (deferred — easy to drive once anyone wants to extract a refresh JWE from claude.ai's local cache)

See also

  • docs/oauth-refresh-reuse-detection.md — full design + threat model + DDL examples + grants + failure modes + rollback
  • docs/sql/oauth-state.sql — operator DDL (clustered + single-node flavors)
  • /Users/Workspaces/acm/mcp/.wiki/mcp-oauth-debugging.md § H-2 — operator wiki rollout checklist + per-deployment ops

🤖 Generated with Claude Code

@BorisTyshkevich BorisTyshkevich force-pushed the feature/oauth-require-email-verified branch from 804528d to 5302328 Compare May 7, 2026 17:38
Refs #103.

Embed jti + family_id into MCP-issued refresh JWEs in gating mode;
persist consumed jtis + revoked families in two ClickHouse tables in
the new `altinity` database; revoke the entire family when a previously
-redeemed jti is replayed. This closes the silent 30-day replay window
for captured gating-mode refresh tokens (OAuth 2.1 §4.13.2, MCP auth
2025-11-25).

Opt-in via oauth.refresh_revokes_tracking (default false). Operators
must run docs/sql/oauth-state.sql + GRANT INSERT,SELECT on altinity.*
to mcp_service before flipping the flag — startup refuses to boot if
the flag is on with mode: forward, clickhouse.read_only: true, or an
empty clickhouse.database.

Hard fail with ERR-level zerolog on every CH error path; never
silently mints a new pair when state lookup fails.

Legacy refresh tokens (lacking family_id/jti) are rejected with
invalid_grant on first redemption after deploy; clients re-auth once.
Auto-promotion was rejected because it would let a captured pre-deploy
token be replayed exactly once before the family starts tracking.

Forward mode is intentionally out of scope: Auth0 itself rotates and
detects reuse upstream, and our wrapper stays stateless to keep
horizontal scaling cheap.

Tests cover: HappyPath, ReplayRevokesFamily, LegacyTokenRejected,
StateUnreachable, plus three startup-validation rejections (forward
mode, read_only=true, empty database). All run with an in-memory
fake oauth_state.Store; live-CH SQL roundtrip is validated by the
otel-mcp deployment's negative replay test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@BorisTyshkevich BorisTyshkevich force-pushed the feature/oauth-refresh-reuse-detection branch from 6e6598c to 0f3318d Compare May 7, 2026 17:38
BorisTyshkevich and others added 3 commits May 8, 2026 13:44
Pre-merge external review of #106 (H-2) flagged that the previous
SELECT-then-INSERT in pkg/oauth_state/store.go is not race-safe: two
parallel redemptions of the same captured refresh JWE both observe
count()=0 on the SELECT, both succeed at the INSERT, and the family
forks before any reuse is detected. From that point an attacker who
stole one token has their own valid branch of the family chain and
never needs the original token again.

RFC 9700 §refresh-token rotation requires atomic *consume-before-mint*:
a check-then-act pattern where the second redeemer cannot see "not
consumed" while the first hasn't yet committed is the only model that
prevents the parallel-replay fork. The fix uses ClickHouse KeeperMap
with `keeper_map_strict_mode=1` — a Raft-linearizable KV that throws
a duplicate-key exception on collision. Concurrent INSERTs are
serialised through the Keeper leader; exactly one wins, the rest
receive `KEEPER_EXCEPTION: Transaction failed (Node exists)`
(Code 999). The losing pods record the family revocation and reject
the request.

The earlier H-2 plan rejected KeeperMap with the rationale "couples
OAuth correctness to Keeper". That rationale was wrong: ReplicatedMergeTree
is already on Keeper for replication coordination. KeeperMap doesn't
add coupling — it exposes the linearizable primitive Keeper already
provides.

Implementation:

* pkg/oauth_state/store.go reshape: cheap revoked-family check first
  (point lookup on ReplicatedMergeTree), then atomic INSERT into the
  KeeperMap consumed-jtis table with `SETTINGS keeper_map_strict_mode=1`
  on the statement (table-level SETTINGS is silently ignored — verified
  empirically against CH 26.1.6). On duplicate-key exception, record
  the family revocation best-effort and return ErrRefreshReused. The
  revoke-table INSERT is non-fatal: KeeperMap atomicity alone is what
  enforces single-claim; revoked_families is the audit trail.

* isKeeperMapDuplicateKeyError matches the exact Keeper error string
  ("Transaction failed (Node exists)") observed against 26.1.6 stock
  and 26.1.11.20001.altinityantalya. The phrase is specific to Keeper
  Raft create-if-absent rejections; other Keeper errors (timeout,
  leader-loss) produce different messages.

* New pkg/oauth_state/cleanup.go runs an in-process goroutine on an
  hourly ticker issuing `ALTER TABLE altinity.oauth_refresh_consumed_jtis
  DELETE WHERE consumed_at < now() - INTERVAL 35 DAY`. KeeperMap doesn't
  support CH-native TTL, so cleanup is application-side. Multi-pod
  deployments all run their own loops — duplicate ALTER DELETEs are
  harmless. Failures log at WARN, never panic; cleanup failure does not
  affect the security control (KeeperMap atomicity does).

* pkg/server/server.go starts the cleanup loop when
  RefreshRevokesTracking is enabled. serverCleanupRunner adapter reads
  s.refreshStateStore on each tick so SetRefreshStateStore (test-only)
  takes effect on the cleanup loop too.

* docs/sql/oauth-state.sql switches consumed_jtis to
  ENGINE = KeeperMap('/altinity_mcp/oauth_refresh_consumed_jtis')
  PRIMARY KEY jti. revoked_families stays ReplicatedMergeTree (INSERTs
  are idempotent for revocation; CH-native TTL handles cleanup). Adds
  ALTER DELETE to the mcp_service grants for the cleanup goroutine.

* docs/oauth-refresh-reuse-detection.md and the operator wiki
  (acm/mcp/.wiki/mcp-oauth-debugging.md) document the new
  keeper_map_path_prefix CH config drop-in prerequisite, the
  query-level SETTINGS load-bearing constraint, and the engine-mixed
  storage shape.

Tests:

* TestOAuthRefreshReuseDetection_AtomicConcurrentClaim hammers the
  refresh handler with 50 goroutines redeeming the SAME refresh JWE
  in parallel via a barrier release. Asserts exactly 1 success and 49
  invalid_grant with reuse_detected. The fake store synchronises via
  sync.Mutex (faithful to KeeperMap's exactly-one-winner semantics).
  This is the regression test for the parallel-replay race.

* fakeRefreshStateStore extended with no-op Cleanup to satisfy the
  Store interface change.

Verified against otel CH cluster 9572 during development:
* keeper_map_path_prefix added via acmctl setting create + push
* DROP+CREATE migrated consumed_jtis to KeeperMap
* GRANT INSERT, SELECT, ALTER DELETE applied
* Smoke INSERT-twice produced KEEPER_EXCEPTION as expected

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eplay)

Two scripts + README covering refresh-token reuse detection end-to-end
against a live deployment, complementing the unit test
TestOAuthRefreshReuseDetection_AtomicConcurrentClaim:

- scripts/qa/h2-replay-test.sh — drives the OAuth dance interactively
  (DCR + Auth0 browser login captured by an inline python http.server
  on localhost:8910), then exercises the sequential reuse-detection
  path: refresh R0 (expect 200 + R1), replay R0 (expect 400
  invalid_grant reuse_detected), bonus replay R1 (expect 400 too —
  family-wide revocation per RFC 9700).

- scripts/qa/h2-parallel-test.sh — same OAuth dance, then fires N
  concurrent redemptions of the same R0 via backgrounded curl + wait.
  Asserts exactly 1 × 200 OK + N-1 × 400 invalid_grant; non-zero exit
  on anomaly. This is the test that proves KeeperMap strict mode's
  atomicity: the SELECT-then-INSERT design would fail this; the new
  code shouldn't. Verified passing against otel-mcp during
  development.

- scripts/qa/README.md — when to run, prerequisites (jq + openssl +
  python3 + browser), state-table verification SQL, failure-mode
  triage.

Verified against otel-mcp (cd65134, oauth-h2-atomic image):
- replay-test → consumed_jtis row (R0's jti) + revoked_families row
  (family_id, reason=reuse_detected); R1 also rejected
- parallel-test (N=50) → exactly 1 × 200, 49 × 400 reuse_detected;
  consumed_jtis +1, revoked_families +1

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…im re-check

Second-pass review of #106 flagged that the consumed-jti claim was made
atomic via KeeperMap, but the revoked-families table stayed
ReplicatedMergeTree and was checked on every refresh. ReplicatedMergeTree
replication is asynchronous: a recent revoke INSERT on pod A can be
invisible to pod B's read on a different CH replica for some replication
lag. That window lets the winner of a forked family keep refreshing on
different pods/replicas while the revoke catches up — the same parallel-
replay window the KeeperMap consumed-jti claim was supposed to close.

Three fixes:

1. revoked_families becomes KeeperMap, no strict mode (idempotent
   overwrite — parallel losers writing the same family_id is fine).
   KeeperMap reads are linearizable through Keeper Raft, so a revoke
   write on pod A is visible to pod B's next pre-check regardless of CH
   replica routing.

2. Revoke INSERT failures hard-fail instead of WARN-and-return. The
   previous code logged WARN and returned ErrRefreshReused even if the
   revoke row didn't land — leaving the winning fork alive on every
   subsequent refresh. RFC 9700 family-wide revocation must be
   authoritative, not best-effort. On revoke INSERT failure we now
   return a generic error which the handler turns into HTTP 500
   server_error; operators see an ERR-level "SECURITY: reuse detected
   but revoke INSERT failed" log and page on the underlying Keeper or
   grant issue.

3. Post-claim revoked re-check. Step 1 (pre-check) and step 2 (atomic
   claim) are not a single Keeper transaction — there's a microsecond
   TOCTOU window where another pod can revoke between them. After a
   successful claim we re-check revocation; if the family was revoked
   during the window we return ErrRefreshReused (the consumed-jti slot
   is spent but no token is minted). This was the residual concern
   that even with KeeperMap-backed revoked_families wasn't fully
   closed by the pre-check alone.

Schema migration applied to otel CHI 9572: DROP +
ReplicatedMergeTree → CREATE + KeeperMap. 2 existing audit rows
discarded; the affected families' refresh tokens are still rejected
by the consumed-jti pre-check on the new path.

Cleanup goroutine now runs ALTER TABLE … DELETE on both KeeperMap
tables on the same hourly ticker. Failures on either are wrapped and
returned together so observability isn't masked.

Tests:

* TestOAuthRefreshReuseDetection_RevokeInsertFailureHardFails pins
  the new hard-fail behavior. The fake store grew a `failRevoke`
  knob: when armed, the duplicate-detected branch returns a generic
  error instead of ErrRefreshReused, simulating a Keeper write
  failure on the revoke. The handler must respond with 500
  server_error, NOT 400 invalid_grant — otherwise the regression
  silently restores the "winning fork stays alive" hole.

* Existing TestOAuthRefreshReuseDetection_AtomicConcurrentClaim
  (50 concurrent redeemers, exactly 1 winner) still passes 10× runs
  + race detector. The fake's reuse-detected branch now also
  defaults to recording the revoke before returning ErrRefreshReused,
  matching the new production semantics.

* Documentation updated in docs/oauth-refresh-reuse-detection.md
  ("Why KeeperMap (for both tables)", "Revoke must persist", "Post-
  claim revocation re-check") and the operator wiki to call out the
  KeeperMap-on-both-tables shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refresh-token reuse detection in gating mode (H-2)

1 participant