oauth: CIMD inbound + DCR removal + HA replay (#115)#116
Conversation
Replace Dynamic Client Registration with OAuth Client ID Metadata Documents as the only inbound MCP OAuth client mechanism. Aligns altinity-mcp with the MCP authorization spec direction (DCR retired 2025-11-25 in favor of CIMD) and lets the upstream IdP be the cross-replica replay oracle. CIMD resolver (cmd/altinity-mcp/cimd.go, new): - HTTPS-only URL validation: no userinfo/fragment/query, port 443 only, dot-segment + encoded-slash + IDN normalization rejection. - SSRF-safe fetcher: custom DialContext explicitly resolves DNS, blocks loopback / RFC1918 / link-local / multicast / IPv6 ULA / CGNAT / 0.0.0.0/8 / 192.0.0.0/24, pins dial to a validated IP, post-dial address re-check, no env proxy, no redirects, 3s timeout, 5 KiB body limit, JSON-only. - Schema validation: client_id must equal request URL, token_endpoint_auth_method must equal "none", redirect_uris bounded and deduped + https-only, refresh_token tolerated in grant_types (but unused), client_secret/private_key_jwt rejected. - In-memory LRU cache with Cache-Control: max-age (capped at 1h), no-store honored, negative-cache 30s, never overrides a positive entry. DCR removal (cmd/altinity-mcp/oauth_server.go): - handleOAuthRegister and its route deleted; /oauth/register now returns 404. - /.well-known/oauth-authorization-server drops registration_endpoint and refresh_token, advertises token_endpoint_auth_methods_supported=["none"] and client_id_metadata_document_supported: true. - handleOAuthTokenRefreshDispatch / handleOAuthTokenRefreshForward / mintForwardRefreshToken deleted; refresh_token grant returns unsupported_grant_type. CIMD clients re-authorize in v1. - parseStatelessRegisteredClient + authenticateClientSecret + hex import deleted as unused. HA replay model (#115 § HA replay): - /oauth/callback no longer POSTs to upstream /token. Instead it wraps the upstream auth code + upstream PKCE verifier + redirect_uri + code_challenge + scope/resource in a new 60s downstream JWE auth code and 302s back to the MCP client. - /oauth/token now does the upstream exchange. Upstream invalid_grant maps to downstream invalid_grant — this is the cross-replica replay verdict. No pod-local replay cache. Upstream IdP (Google or Auth0) is the sole used-codes oracle, eliminating the JWE auth-code replay window that the previous design accepted as "PKCE-bound only". - hkdfInfoOAuthAuthCode bumped to /v2 so any v1 codes in flight at the cutover decrypt as garbage (60s TTL means this is harmless). - oauthIssuedCode struct shed UpstreamBearerToken / UpstreamRefreshToken / UpstreamTokenType / Subject / Email / Name / HostedDomain / EmailVerified / AccessTokenExpiry; added UpstreamAuthCode + UpstreamPKCEVerifier. Tests: - cimd_test.go (new): URL validation, SSRF rejection table, schema validation, fetch safety (oversize body, non-JSON, redirect rejected), cache (max-age, no-store, TTL cap, negative cache, key exactness), ssrfSafeDial direct. - oauth_ha_replay_test.go (new): fake upstream that invalid_grants on the second redemption; asserts first /token returns access_token, second returns downstream invalid_grant, and no refresh_token is ever issued. - DCR-dependent tests deleted from oauth_server_test.go (TestOAuthHTTPDiscoveryAndRegistration, TestOAuthRegistrationNegative, TestOAuthForwardModeRefresh, TestOAuthForwardModeNoRefreshToken, TestOAuthAuthorizeOfflineAccessScope, TestOAuthAuthorizeNegative, TestOAuthCallbackNegative, TestOAuthTokenExchangeNegative, TestOAuthMetadataAdvertisesRefreshToken, TestAuthenticateClientSecret, TestParseStatelessRegisteredClient, TestOAuthForwardModeBrowserLogin*, TestOAuthForwardModeTokenResourceMismatch, TestOAuthMCPAuthInjectorForwardModeValidatesJWT, TestOAuthJWEHKDFRoundtripAndLegacyFallback, TestOAuthStateJWERoundTrip, TestRegisterOAuthHTTPRoutesAliases) along with their now-dead helpers. CIMD coverage is in cimd_test.go; HA replay in oauth_ha_replay_test.go. pkg/jwe_auth/jwe_auth.go: add upstream_auth_code to the JWE claim whitelist so the new downstream code claims pass validation. go.mod: add golang.org/x/net/idna (IDNA hostname normalization). v1 explicitly out of scope (see #115 § Non-goals): - Resource indicators / RFC 8707 audience binding - Scope binding and consent UI - Refresh tokens for CIMD clients - DPoP, private_key_jwt, optional display assets (logo_uri etc.), operator hostname/port allowlists. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
E2E verification — both modes pass with
|
Self-review follow-up to #116. No behaviour change; CIMD wire shape, JWE constants, HA replay semantics all unchanged. E2E reverified on otel-google-mcp with btyshkevich@gmail.com. - Adopt go-sdk's oauthex.ClientRegistrationResponse in parseCIMDMetadata instead of an inline anonymous struct. CIMD docs are the same JSON shape as RFC 7591 client registration; the SDK already types them with field- level documentation. Extra SDK fields (logo_uri, tos_uri, etc.) are ignored, same as before. Saves ~10 LOC; nothing on the wire changes. - Move resolveCIMDClient from package-level var-with-test-stub-hatch to a method on *application backed by a cimdResolver field. Drops the sync.Once singleton and the test-time var swap. Tests that need a fake resolver construct one via the existing testResolver helper and inject it through the application struct (oauth_ha_replay_test.go). - Trim three dead fields from statelessRegisteredClient (GrantType, ExpiresAt, ClientSecret) — all DCR-era; nothing reads them post-DCR. Now a two-field struct. - Simplify validateCIMDPath: collapse three overlapping dot-segment checks (raw segment / decoded segment / %2E expansion) into a single path.Clean(decoded) != decoded comparison, the exact formulation the issue calls for. Encoded slash / backslash check kept separate (path.Clean can't see them). Same reject-set, fewer code paths. - DRY the two .well-known handlers behind one oauthASMetadata helper; handleOAuthOpenIDConfiguration only tacks on its OIDC-specific extras. Eliminates field-by-field drift between RFC 8414 and OIDC discovery documents. - Drop the redundant isCIMDClientID prefix check at /authorize and /token. validateCIMDClientIDURL already enforces scheme==https inside the resolver, so the standalone helper was dead weight. - Replace the parallel ip.IsPrivate() + manual range arithmetic in isBlockedIP with a single audit-friendly []net.IPNet (ssrfBlockedCIDRs) parsed via net.ParseCIDR. One list to read, one list to maintain; same blocklist semantics. Adds an explicit comment naming the RFCs. - Small: apply truncateForLog(clientID, 80) to the three client_id log fields under /authorize, /callback, /token; drop _=resource in handleOAuthTokenAuthCode (replaced by a focused RFC 8707 mismatch check); thread the cache's logical clock through cimdCache.put for test consistency. Verification: go test ./... and go vet ./... green. Local image ghcr.io/altinity/altinity-mcp:cimd-115-refactor-dac3961-arm64 deployed to otel-google-mcp (forward mode); claude.ai connector cimd-rfx-703658 → whoami=btyshkevich@gmail.com, execute_query SELECT currentUser(), 1+1 → u=btyshkevich@gmail.com, two=2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refactor pass (commit
|
Self-review notes returned nine concrete findings; tightening each. 1. **max-age=0 bug (real)**. extractMaxAge returned 0 for max-age=0, which was then treated as "directive absent" and fell through to the 5-minute default TTL — opposite of RFC 7234 semantics. Replaced with a proper directive-aware parser cacheTTLFromHeader: max-age=0 (and negative) correctly returns ttl=0, treated identically to no-store / no-cache. Regression test TestCIMDResolve_MaxAgeZeroSkipsCache plus a TestCacheTTLFromHeader matrix. 2. **Cache-Control parsing too loose**. strings.Contains on the lowercased header would have matched x-custom-no-storage as no-store. Same new parser does directive-level matching: it splits on ',', trims, and compares each directive exactly. Test rows in TestCacheTTLFromHeader cover x-custom-no-storage and "no-storage" alone — both fall through to the default TTL as expected. 3. **_ = identityClaims looked like dead code**. Replaced with a comment explaining that the validation has a 502-side-effect that's the whole point, and that audience binding is deferred to a follow-up — so the line must not be pruned without re-introducing claim binding. 4. **FIFO/LRU misnomer**. Cache evicts oldest-inserted and never reorders on get — that's FIFO. Comment + the cap field renamed accordingly; added a sentence on why FIFO is fine here (cap >> unique CIMD URLs). 5. **_ = host vestige** in testResolver removed. 6. **Post-dial recheck comment** clarified — explicit-IP dial means the recheck is defense against future refactors, not active rebinding protection in the current code path. 7. **cap → capacity rename** on cimdCache to stop shadowing the builtin. 8. **refresh_token tolerate-but-ignore** now has a comment in parseCIMDMetadata warning future refresh-token implementers that the CIMD grant_types array is NOT authoritative for what we issue — the .well-known AS metadata is. 9. **SSRF blocklist extended** to the IANA Special-Purpose Address registries (RFC 6890): added 192.0.2.0/24 (TEST-NET-1), 198.18.0.0/15 (benchmarking), 198.51.100.0/24 (TEST-NET-2), 203.0.113.0/24 (TEST-NET-3), 240.0.0.0/4 (reserved, includes 255.255.255.255), 2001:db8::/32 (IPv6 docs), 64:ff9b::/96 (IPv4/IPv6 translation), 100::/64 (IPv6 discard). Each entry annotated with its RFC. Tests in TestIsBlockedIP extended to cover all new ranges. Also: TestCIMDResolve_DefaultTTLWhenNoDirectives pins the 5-minute default-TTL path with a Cache-Control: private header (neither no-store nor max-age). All tests + go vet green. No behaviour change beyond the max-age=0 bug fix; no API or wire change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review nits addressed (commit
|
| # | Issue | Fix |
|---|---|---|
| 1 | max-age=0 fell through to 5-min default (real bug) | New cacheTTLFromHeader: max-age=0 and negatives → ttl=0. Regression test + 14-row matrix. |
| 2 | strings.Contains on Cache-Control too loose (would match x-custom-no-storage as no-store) |
Same new parser does exact directive matching, splitting on , and trimming. |
| 3 | _ = identityClaims looked like dead code |
Replaced with a comment explaining the 502-side-effect intent and warning future cleanup not to prune. |
| 4 | Cache called itself LRU but was FIFO | Renamed comments + cap field; cap → capacity (also fixes issue #7 builtin shadow). Kept FIFO honestly. |
| 5 | _ = host in testResolver |
Dropped. |
| 6 | Post-dial IP recheck comment over-claimed | Clarified: explicit-IP dial means it's defense against future refactors, not active rebinding mitigation in the current code path. |
| 7 | cap field shadowed cap() builtin |
Renamed to capacity (folded into #4). |
| 8 | refresh_token tolerated-then-ignored without breadcrumb for future implementers | Comment in parseCIMDMetadata says explicitly: CIMD grant_types is NOT authoritative; AS metadata is the source of truth. |
| 9 | SSRF blocklist gap | Extended to the full IANA Special-Purpose registry: 192.0.2.0/24, 198.18.0.0/15, 198.51.100.0/24, 203.0.113.0/24, 240.0.0.0/4, 2001:db8::/32, 64:ff9b::/96, 100::/64. Each annotated with its RFC. TestIsBlockedIP extended. |
New tests: TestCacheTTLFromHeader (14 rows), TestCIMDResolve_MaxAgeZeroSkipsCache, TestCIMDResolve_DefaultTTLWhenNoDirectives. All green; go vet clean. No wire-shape change beyond the max-age=0 bug fix.
+198 / -57 LOC.
Self-review #3 turned up four real bugs, three documentation drifts, and three dead-code / consistency issues. No wire-shape change beyond the bug fixes; all behaviour-affecting fixes have regression tests. Bugs: - **identityClaims comment lied** (#5-dup). The "intentionally unused" comment + bare `_ = identityClaims` next to a block that DOES read identityClaims.ExpiresAt (used to derive `expires_in`) was an invitation for a future "clean up unused vars" PR to silently drop the upstream identity validation. Comment rewritten to name the three jobs the validation actually does; underscore drop assigned. - **cacheTTLFromHeader int64 overflow** (#7). `time.Duration(n) * time.Second` overflows when n > ~9.22e9, wrapping to a negative Duration that hit the "directive absent" sentinel branch and silently returned cimdDefaultCacheTTL (5m) instead of the intended cimdMaxCacheTTL (1h). Clamp n against int(cimdMaxCacheTTL/time.Second) before the multiply. Test rows added in TestCacheTTLFromHeader for max-age=9999999999 and max-age=int64.max. - **upstream 200 OK with RFC 6749 error body mapped to 502 server_error** (#8). Non-RFC-compliant IdPs and test stubs that return HTTP 200 + {"error":"invalid_grant"} hit our "no usable token" branch → 502 server_error. Replay-oracle contract assumes downstream sees invalid_grant. tokenResp struct gains Error / ErrorDescription fields; non-empty Error → downstream invalid_grant regardless of status. New TestOAuthTokenUpstream200WithErrorBody covers this. - **Content-Type prefix-match too permissive** (#9). strings.HasPrefix accepted application/json-ld, application/jsonpatch+json, etc. Replaced with isApplicationJSON helper that splits on ";", trims, case-folds, and exact-matches "application/json". TestIsApplicationJSON covers the new behaviour. UX / contract: - **/oauth/register 404 → 410 Gone + JSON** (#11). Legacy DCR clients hitting the route now get an RFC 7591 §3.2.2-shaped error body pointing them at CIMD, not Go's bare text/plain 404. New handleOAuthRegisterRemoved handler. TestOAuthRegisterGone. - **upstream_offline_access description rewritten** (#6). Previous desc tag claimed the flag "issues JWE-wrapped refresh tokens" — post-#115 it doesn't, and can't. Rewritten to explain the flag is upstream-scope-only and v1 issues no downstream refresh tokens regardless of its value. Dead code: - **#12 removals**: defaultRefreshTokenTTLSeconds (oauth_server.go), oauthRegistrationPath() method, OAuthConfig.RegistrationPath + YAML fixture + 2 test assertions, decodeStringSlice + TestDecodeStringSlice, statelessRegisteredClient.TokenEndpointAuthMethod (write-only field). statelessRegisteredClient is now one field. - **#14 dec.UseNumber no-op + string copy**: oauthex.ClientRegistrationResponse has a custom UnmarshalJSON that bypasses outer-decoder settings, so UseNumber() was a no-op and strings.NewReader(string(body)) was a wasted copy. json.Unmarshal(body, &doc) directly. Style: - **Named upstream HTTP timeout** (#1). oauthUpstreamHTTPTimeout = 10 * time.Second constant; both call sites use it. Docs / comments: - **oauthKidV1 comment** (#13). Said "client_id / refresh-token JWE artifacts" and "30-day window" — neither exists post-#115. Rewritten to name pending-auth (10 min) as the longest live legacy artifact and to indicate the SHA256(secret) fallback can be deleted in a follow-up after one rolling restart. - **mustJWESecret error string**: said "OAuth client registration and forward-mode token wrapping" → now "JWE-wrapped pending-auth state and downstream auth-code minting". - **oauth_server_test.go stale tombstone** removed. - **oidcScopesForAdvertisement** doc no longer mentions DCR responses. - **docs/oauth_authorization.md** banner explains the #115 cutover (DCR removed → CIMD only, no downstream refresh tokens, HA replay). Tests added or strengthened: TestCacheTTLFromHeader (+2 overflow rows), TestIsApplicationJSON (8 cases), TestValidateCIMDClientIDURL_Reject (+data:/javascript:/file: scheme rejects), TestOAuthRegisterGone, TestOAuthTokenRefreshGrantUnsupported, TestOAuthASMetadataShape, TestOAuthTokenUpstream200WithErrorBody. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
E2E verified on commit `85f468f`Built locally (`scripts/build-mcp-image.sh cimd-115-nits` → `ghcr.io/altinity/altinity-mcp:cimd-115-nits-85f468f-arm64`), deployed to `otel-google-mcp` (forward mode), reran the same flow we used for the prior two commits. Endpoint shape``` POST /oauth/token (grant_type=refresh_token) Both new contract guarantees from this commit confirmed live. claude.ai chat with btyshkevich@gmail.com (connector `cimd-nits-658c91`)
CH user dropped + recreated around the flow as required by the documented forward-mode `replicated`-shadow workaround. Connector removed afterwards. PR is ready for final review. |
Summary
Implements #115: replace Dynamic Client Registration with OAuth Client ID Metadata Documents as the only inbound MCP OAuth client mechanism. Aligns altinity-mcp with the MCP authorization spec direction (DCR retired 2025-11-25 in favor of CIMD). Bundles the §HA replay model so the upstream IdP becomes the cross-replica auth-code single-use oracle.
cmd/altinity-mcp/cimd.go, new): HTTPS-only URL validation (no userinfo/fragment/query, port 443, dot-segment + IDN normalisation), SSRF-safe fetcher (explicit DNS resolution + IP blocklist + post-dial recheck, no env proxy, no redirects, 3s timeout, 5 KiB body limit, JSON-only), schema validation (client_idbyte-match,token_endpoint_auth_method: none, bounded + deduped + https-onlyredirect_uris, rejectsclient_secret/private_key_jwt), in-memory LRU cache honouringCache-Control: max-age(capped 1h),no-store, negative-cache 30s./oauth/register404s;.well-known/oauth-authorization-serverdropsregistration_endpointandrefresh_token, advertisestoken_endpoint_auth_methods_supported=["none"]andclient_id_metadata_document_supported: true. Refresh-token grant returnsunsupported_grant_type. v1 clients re-authorize./oauth/callbackno longer redeems the upstream auth code — it wraps it (with the upstream PKCE verifier) in a 60s downstream JWE./oauth/tokendoes the upstream POST now; upstreaminvalid_grant→ downstreaminvalid_grant. No pod-local replay cache. Google / Auth0 invalid_grant is the cross-replica replay verdict.v1 deliberately out of scope: resource indicators (RFC 8707), scope binding, consent UI, refresh tokens, DPoP, private_key_jwt, operator allowlists. Each gets a follow-up issue.
Test plan
go test ./...— all green (cimd_test.go, oauth_ha_replay_test.go, surviving oauth_server_test.go cases).go vet ./...clean.otel-google-mcp); runtest-mcp-connectoragainst claude.ai + chatgpt.com withbtyshkevich@gmail.com; confirmwhoami+execute_query SELECT 1+1.otel-google-gating-mcp); same probe./authorizethrough pod A and/tokenthrough pod B; replay/token, confirminvalid_grant.client_id=https://localhost/x.jsonandclient_id=https://169.254.169.254/x.jsonfail before any outbound fetch.🤖 Generated with Claude Code