Skip to content

Retry OAuth token refresh on infrastructure 4xx#5170

Merged
jhrozek merged 1 commit into
stacklok:mainfrom
gkatz2:fix/retry-4xx-token-refresh-5169
May 12, 2026
Merged

Retry OAuth token refresh on infrastructure 4xx#5170
jhrozek merged 1 commit into
stacklok:mainfrom
gkatz2:fix/retry-4xx-token-refresh-5169

Conversation

@gkatz2
Copy link
Copy Markdown
Contributor

@gkatz2 gkatz2 commented May 3, 2026

Summary

  • The OAuth token refresh endpoint returns 4xx for infrastructure events (WAF/firewall blocks, rate limits, transient bad-config deploys) as well as for OAuth protocol failures. Today every 4xx is treated as a permanent auth failure, killing the workload until manual re-authentication. A momentary VPN flap that routes a refresh request through a non-allowlisted IP is enough to leave the workload dead.
  • Treat 4xx responses that lack a structured RFC 6749 error code as transient, so they enter the existing retry loop added in Retry OAuth token refresh on server errors #4513. Only 4xx with a populated error code (invalid_grant, invalid_client, etc.) remains permanent. 429 is always transient per HTTP standard.

Fixes #5169

Type of change

  • Bug fix

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Added TestIsTransientNetworkError_AgainstRealOAuth2Library, which exercises the classification function end-to-end through golang.org/x/oauth2.Config.TokenSource against an httptest.Server returning each response shape (HTML 403, HTML 401, JSON invalid_grant, JSON invalid_client, 429, 503). This pins the assumption that RetrieveError.ErrorCode is populated only for parseable RFC 6749 error responses, so a future oauth2 upgrade that changes that behavior would surface here, not in production.

API Compatibility

  • This PR does not break the v1beta1 API, OR the api-break-allowed label is applied and the migration guidance is described above.

Does this introduce a user-facing change?

Remote MCP servers with OAuth authentication now survive transient 4xx token-endpoint events (WAF blocks, rate limits, brief upstream config errors) instead of permanently going dark on the first occurrence.

Special notes for reviewers

Classification rule

golang.org/x/oauth2 populates RetrieveError.ErrorCode only when the response body is parseable JSON containing an RFC 6749 error field. An empty ErrorCode therefore signals an infrastructure-level response (HTML page from a WAF, CDN, or reverse proxy), not an OAuth protocol failure. The new classifyOAuthRetrieveError helper applies this rule:

  • 5xx: transient (existing behavior, from Retry OAuth token refresh on server errors #4513)
  • 429: transient regardless of body (HTTP standard)
  • 4xx with empty ErrorCode: transient (infrastructure error)
  • 4xx with populated ErrorCode: permanent (OAuth protocol failure)

Scope

Classification only — no changes to retry parameters, monitor lifecycle, or workload state machine. Outages that span longer than a single refresh attempt (e.g. a multi-hour VPN outage that crosses a token expiry boundary) still result in the workload being marked unauthenticated; that's a separable architectural change requiring a new "transiently failing" state and is not part of this fix.

Interaction with #5044

PR #5044 (in flight) emits a DCR remediation hint when classification returns permanent 4xx. This change reduces false triggers of that hint: the hint will fire for invalid_grant/invalid_client (where DCR creds genuinely may be stale) but no longer for WAF blocks (where the hint would have been misleading). Pure mechanical merge conflict on the same file, no semantic conflict.

Generated with Claude Code

@github-actions github-actions Bot added the size/S Small PR: 100-299 lines changed label May 3, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.82%. Comparing base (9572f6a) to head (10b8e5e).
⚠️ Report is 18 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5170      +/-   ##
==========================================
+ Coverage   67.65%   67.82%   +0.16%     
==========================================
  Files         607      610       +3     
  Lines       61982    62407     +425     
==========================================
+ Hits        41937    42325     +388     
- Misses      16883    16910      +27     
- Partials     3162     3172      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jhrozek
jhrozek previously approved these changes May 5, 2026
Copy link
Copy Markdown
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Classification logic is RFC 6749 / 6585 compliant, the real-library contract test is the right way to defend against future golang.org/x/oauth2 upgrades changing ErrorCode semantics, and singleflight already prevents the cross-replica refresh-rotation race from getting worse. PR scope is tight (classification only, ~206 LOC).

Two non-blocking comments below — both are suggestions for follow-up scope and a small test-design refinement, not anything that should hold this up.

Comment thread pkg/auth/monitored_token_source.go
Comment thread pkg/auth/monitored_token_source_test.go
The OAuth token refresh endpoint returns 4xx for infrastructure
events (WAF/firewall blocks, rate limits, transient bad-config
deploys) as well as for OAuth protocol failures. Today every 4xx
is treated as a permanent auth failure, killing the workload
until manual re-authentication. A momentary VPN flap that routes
a refresh request through a non-allowlisted IP is enough to
leave the workload dead.

Treat 4xx responses that lack a structured RFC 6749 error code
as transient; only 4xx with a populated error code
(invalid_grant, invalid_client, etc.) remains permanent. 429 is
always transient per HTTP standard.

Fixes stacklok#5169

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Greg Katz <gkatz@indeed.com>
@gkatz2 gkatz2 force-pushed the fix/retry-4xx-token-refresh-5169 branch from 9de8e4f to 10b8e5e Compare May 7, 2026 18:06
@gkatz2
Copy link
Copy Markdown
Contributor Author

gkatz2 commented May 7, 2026

I force-pushed an amended version with the rebase plus a few additional refinements. Headline:

  1. Rebased onto current main (clean conflict resolution against Wire authserver DCR resolver and add structured logs #5044 and Set User-Agent on OAuth token refresh requests #5168).

  2. Tightened isPermanentTokenEndpointError to be the strict inverse of isTransientRetrieveError on the *oauth2.RetrieveError branch. The DCR remediation Warn now fires only when the response carries a populated RFC 6749 error code — not on 4xx-with-empty-ErrorCode (HTML pages from a WAF, CDN, or reverse proxy). The reasoning: a non-spec-compliant response means the OAuth server didn't render a verdict, so recommending "delete cached credentials" based on it would frequently mislead operators whose real problem is upstream of the OAuth server. This is a behavior tweak to code originally added in Wire authserver DCR resolver and add structured logs #5044 — flagging this for @tgrunnagle in case you want to weigh in.

  3. Renamed classifyOAuthRetrieveErrorisTransientRetrieveError for naming symmetry with isPermanentTokenEndpointError and to match the file's is<Adjective><Noun>Error convention.

  4. Added TestIsPermanentTokenEndpointError to assert the new inverse behavior directly. The function had 100% line coverage before, but no test asserted on its return value, so the semantic change wouldn't have failed any existing test.

  5. Trimmed TestIsTransientNetworkError_AgainstRealOAuth2Library per your comment.

Please re-review when you have a chance, @jhrozek. The diff between the previous approval state and current HEAD is centered on pkg/auth/monitored_token_source.go around isPermanentTokenEndpointError / isTransientRetrieveError.

@github-actions github-actions Bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels May 7, 2026
Copy link
Copy Markdown
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both review items from the previous pass are resolved. The Retry-After point was non-blocking and explicitly a follow-up.

Also a nice bonus: tightening isPermanentTokenEndpointError to require a populated ErrorCode means the DCR remediation Warn no longer fires on WAF/CDN HTML pages, which would have misled operators. Worth flagging @tgrunnagle since this touches the Warn from #5044.

@jhrozek
Copy link
Copy Markdown
Contributor

jhrozek commented May 12, 2026

I apologize for the delay @gkatz2 this somehow slipped in the review queue madness

@jhrozek jhrozek merged commit 84f606a into stacklok:main May 12, 2026
73 of 75 checks passed
@gkatz2
Copy link
Copy Markdown
Contributor Author

gkatz2 commented May 12, 2026

I apologize for the delay @gkatz2 this somehow slipped in the review queue madness

All good, @jhrozek! Thanks for the review. I can't imagine what it is like to handle the PR load on an open source project these days. 🫡

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Small PR: 100-299 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Token refresh treats transient 4xx as permanent auth failures

2 participants