HA-safe OAuth state + Stateless StreamableHTTP transport (closes #113)#114
Merged
Conversation
Replace the per-pod oauthStateStore (in-memory maps for pending-auth and issued auth codes) with stateless JWE tokens, using the existing encodeOAuthJWE/decodeOAuthJWE + HKDF infrastructure already proven for DCR client_id and refresh tokens. Why: in forward mode altinity-mcp is the OAuth AS — /.well-known points clients at MCP's own /authorize and /token, so with replicas>=2 and no sticky sessions the legs of the OAuth dance land on different pods and the in-memory state lookup fails (~75% of the time). Encoding the state into the Google `state` parameter and the MCP auth `code` makes any replica with the shared signing_secret able to decrypt either side. Single-use enforcement on auth codes is intentionally not done server- side: codes are bound to the client's PKCE verifier (RFC 7636) and live 60s, so replay within the TTL is limited to whoever holds the verifier. Trading strict RFC 6749 §4.1.2 single-use for zero shared state. New HKDF labels: altinity-mcp/oauth/pending-auth/v1 altinity-mcp/oauth/auth-code/v1 Whitelist additions in jwe_auth: resource, upstream_pkce_verifier. Removed: oauthStateStore, its mutex, eviction logic, maxOAuthStateEntries, randomToken, application.oauthState/oauthStateMu fields, getOAuthStateStore. Replaced TestOAuthStateStore*/TestOAuthStateStoreEviction with TestOAuthStateJWERoundTrip covering round-trip, cross-pod portability, mismatched-secret rejection, expiry, tamper, and missing secret. Affects forward-mode deployments only (antalya, billing, otel-google). Gating-mode (otel, github via Auth0 CIMD) was already HA-safe — Auth0 owns the OAuth surface there.
NewStreamableHTTPHandler defaults to session-tracked mode where each pod issues and validates its own Mcp-Session-Id. Under replicas>=2 with non-sticky load balancing, the MCP `initialize` call lands on whichever pod the LB picks, the client picks ONE returned session-id, and any subsequent tool call that lands on the OTHER pod is rejected with code 32600 "Session terminated". Switch both NewStreamableHTTPHandler call sites to Stateless: true. Each request becomes self-contained, no per-pod session table required. Trade-off: server-initiated requests (sampling, roots/list, log notifications outside an active request) are not supported. altinity-mcp only handles client-initiated tool calls, so this is safe today. Pairs with the JWE OAuth-state refactor in 9f16fd3 — together they make forward-mode and gating+broker_upstream deployments HA-safe.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #113.
Makes altinity-mcp safe to run with
replicas >= 2behind a non-sticky load balancer by removing the two pieces of per-pod in-memory state that broke under HA: the OAuth pending-auth / auth-code store, and the streamable HTTP session table.Full reasoning, the alternatives I rejected, and the explicit single-use trade-off are in #113. This description sticks to what changed and how to verify.
Two commits
9f16fd3 — stateless OAuth state via JWE
cmd/altinity-mcp/oauth_server.go:altinity-mcp/oauth/pending-auth/v1andaltinity-mcp/oauth/auth-code/v1.oauthPendingAuthandoauthIssuedCodenow round-trip throughencodeOAuthJWE/decodeOAuthJWE(the same helpers andjwe_auth.DeriveKeyHKDF primitive already used for the stateless DCRclient_id/client_secret/ refresh-token paths).oauthStateStore(the two in-memory maps, mutex, eviction, 10k cap) is deleted.randomTokenis removed (no callers).applicationmethods:encodePendingAuth,decodePendingAuth,encodeAuthCode,decodeAuthCode.cmd/altinity-mcp/main.go:application.oauthState,oauthStateMu, andgetOAuthStateStoreremoved.pkg/jwe_auth/jwe_auth.go:resourceandupstream_pkce_verifier. Required sodecodeOAuthJWEaccepts the new claim keys.058f43d — Stateless StreamableHTTP transport
cmd/altinity-mcp/main.go:mcp.NewStreamableHTTPHandlercall sites now pass&mcp.StreamableHTTPOptions{Stateless: true}. The transport stops issuing and validating per-podMcp-Session-Ids; each request is self-contained.roots/list, log notifications outside an active request). altinity-mcp only handles client-initiated tool calls today — see Stateless OAuth state + Stateless MCP transport for HA #113 for the reasoning behind accepting this.Test changes
Added
TestOAuthStateJWERoundTripincmd/altinity-mcp/oauth_server_test.gowith subtests:pending_auth_round_trip— encode + decode preserves every fieldauth_code_round_tripcross_pod_portable_with_shared_secret— twoapplicationinstances sharing only the signing secret; token minted by one decodes on the othercross_pod_rejected_with_different_secretexpired_auth_code_rejected,expired_pending_auth_rejectedtampered_token_rejected— single-byte flip in the ciphertextdecode_missing_secret_fails_cleanlyRemoved:
TestOAuthStateStoreSizeCap,TestOAuthStateStore,TestOAuthStateStoreEviction— the in-memory store no longer exists.Untouched canaries (still pass):
TestOAuthForwardModeBrowserLoginUsesUpstreamBearerToken,TestOAuthForwardModeNoRefreshToken,TestOAuthE2EWithMockOIDC, the negative-path tests at lines 1603–1753.Hard requirements after merge
MCP_OAUTH_SIGNING_SECRETmust be a shared k8s Secret across replicas. All production deployments already source from<deployment>-mcp-signing-secret.