Skip to content

Stop fabricating partial identity in RestoreSession#5650

Open
tgrunnagle wants to merge 7 commits into
mainfrom
vmcp-restoresession-identity-issue_5336
Open

Stop fabricating partial identity in RestoreSession#5650
tgrunnagle wants to merge 7 commits into
mainfrom
vmcp-restoresession-identity-issue_5336

Conversation

@tgrunnagle

@tgrunnagle tgrunnagle commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Two related bugs caused authenticated backend calls to fail after cross-pod session restore (Redis + `replicas > 1`).

Bug 1 — partial identity fabrication: `RestoreSession` constructed `&auth.Identity{Subject, Claims}` with `Token` and `UpstreamTokens` always empty, violating the field-completeness invariant on `*auth.Identity`. This struct was captured as `identityRoundTripper.fallbackIdentity` in every backend connector, so it was a latent landmine for any future consumer that assumes a non-nil identity is fully populated. Fix: pass `nil` to `makeBaseSession` — live requests carry a fully-populated identity on `req.Context()` from `TokenValidator.Middleware`, so the fallback is never needed.

Bug 2 — backend Initialize unauthenticated during restore: `loadSession` (the cache-miss restore path) called `RestoreSession` with `context.Background()`, discarding the incoming request's identity. Backend MCP `Initialize` handshakes during restore therefore ran with no auth headers — backends requiring `upstreamInject` / `tokenExchange` / `aws_sts` on `Initialize` silently lost their capabilities from the restored session. Fix: thread `context.Context` from `GetMultiSession` → `loadSession` → `RestoreSession`, using `context.WithoutCancel(ctx)` to carry values (identity) without propagating the caller's cancellation signal (required because `singleflight` coalesces concurrent restores).

Fixes #5336

Type of change

  • Bug fix
  • Refactor (context propagation through cache + session manager)

Test plan

  • Unit tests (`task test`)
  • Linting (`task lint-fix`)

Changes

File Change
`pkg/auth/identity.go` Add field-completeness contract to `Identity` godoc
`pkg/vmcp/session/factory.go` Pass `nil` identity to `makeBaseSession` in `RestoreSession`; validate stored binding without fabricating partial struct
`pkg/vmcp/session/identity_binding_test.go` Assert connector receives `nil` identity on restore; rename stale test
`pkg/cache/validating_cache.go` `Get`, `load`, and `check` accept `context.Context`; document singleflight winner-takes-context semantic
`pkg/vmcp/server/session_manager_interface.go` `GetMultiSession(ctx, sessionID)`; doc describes contract for implementors, not Manager internals
`pkg/vmcp/server/sessionmanager/session_manager.go` `GetMultiSession`, `loadSession`, `checkSession` accept context; `loadSession` uses `context.WithoutCancel(ctx)` for both storage load and `RestoreSession`; explain `context.Background()` in `DecorateSession`
`pkg/vmcp/server/sessionmanager/session_manager_test.go` `TestGetMultiSession_PropagatesIdentityToRestoreSession` — pins that `*auth.Identity` and `UpstreamTokens` reach `RestoreSession`; fails if `WithoutCancel` is reverted to `Background`
`pkg/vmcp/server/serve_handlers.go` `enforceSessionBinding` passes handler `ctx` to `GetMultiSession`
`pkg/vmcp/server/serve_optimizer.go` Same for optimizer tool/find handlers
`test/e2e/thv-operator/virtualmcp/virtualmcp_redis_session_test.go` Add `TODO(#5336)` for upstream-auth E2E (needs OIDC provider in test env)

Does this introduce a user-facing change?

No visible change for single-replica deployments (never hit `RestoreSession`). For multi-replica Redis deployments with upstream-auth strategies: backend Initialize handshakes during cross-pod restore are now authenticated, so restored sessions correctly reflect all backend capabilities rather than silently dropping backends that require auth on Initialize.

Special notes for reviewers

  • `context.WithoutCancel`: used in `loadSession` and `checkSession` to copy the caller's identity without propagating its cancellation. This is necessary because `ValidatingCache.Get` deduplicates concurrent requests via `singleflight` — if the triggering caller cancels mid-restore, the other coalesced waiters must still receive a result. Independent per-operation timeouts (`restoreStorageTimeout` / `restoreSessionTimeout`) still apply. The regression guard is `TestGetMultiSession_PropagatesIdentityToRestoreSession`.
  • `DecorateSession` calls `GetMultiSession(context.Background(), sessionID)` — it's a session-setup operation, not a live authenticated request, so there is no caller identity to propagate. Explained by an inline comment.
  • AC2 (`MetadataKeyIdentitySubject`): This constant does not exist in the codebase — it references a code pattern from an older iteration of the issue description. The subject is encoded inside `MetadataKeyIdentityBinding` (format: `iss\x00sub`) and is preserved via `SetMetadata` at `factory.go:471`. AC2 is satisfied; `TestRestoreSession_PreservesIdentityBindingAndAcceptsMatchingCaller` (line 185) confirms the binding round-trips correctly.
  • The nil-identity path through `makeBaseSession` was already safe: all dereferences of `identity` in `factory.go`, `identityRoundTripper.RoundTrip`, and `RestoreHijackPrevention` already guard against nil.

Generated with Claude Code

RestoreSession constructed &auth.Identity{Subject, Claims} with Token
and UpstreamTokens unset, then passed it to makeBaseSession where it
propagated to the identityRoundTripper fallback in each backend
connector. This violated the field-completeness contract on *auth.Identity
and was a latent bug for any future consumer (audit log, outgoing-auth
strategy) that assumes a non-nil *Identity is fully populated.

Implements changes for issue #5336:
- Remove partial identity construction from RestoreSession; pass nil to
  makeBaseSession instead. Binding validation (binding.Parse) is kept to
  catch corrupted stored bindings before any backend work is attempted.
- Add field-completeness contract to auth.Identity godoc: a non-nil
  *Identity is always complete; nil represents the anonymous / no-identity
  state, not a partially-initialised struct.
- Replace the test that pinned the old partial-identity behavior with one
  that asserts nil is passed to the backend connector on restore.

Live tool calls are unaffected: TokenValidator.Middleware places a
fully-populated identity on req.Context() before requests reach the
backend connectors, and identityRoundTripper already only injects its
captured fallback when the request context carries none (see #5323).
- Update RestoreSession implementation doc comment to reflect that
  connectors now receive nil identity (the old comment said identity was
  reconstructed from the stored iss/sub binding, which is no longer true)
- Add TODO comment to the cross-pod restore E2E test marking the
  upstream-auth extension from AC4 of issue #5336 as dependency-gated
  on a live OIDC provider in the test environment
@github-actions github-actions Bot added the size/S Small PR: 100-299 lines changed label Jun 25, 2026
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.90909% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.34%. Comparing base (5310b0c) to head (533301f).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
pkg/vmcp/server/serve_handlers.go 75.00% 0 Missing and 1 partial ⚠️
pkg/vmcp/session/factory.go 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5650   +/-   ##
=======================================
  Coverage   70.34%   70.34%           
=======================================
  Files         649      649           
  Lines       66101    66175   +74     
=======================================
+ Hits        46500    46553   +53     
- Misses      16253    16271   +18     
- Partials     3348     3351    +3     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tgrunnagle tgrunnagle left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-Agent Consensus Review

Agents consulted: security/auth, test-coverage, go-code-quality, architecture, general-quality (5 agents)

Consensus Summary

# Finding Consensus Severity Action
F1 Stale test name TestRestoreSession_PopulatesBothSubjectFieldAndClaims 7/10 MEDIUM Fix
F2 MetadataKeyIdentitySubject AC — connection to fix not documented 9/10 MEDIUM Discuss

Overall

The production-side fix is correct and complete. Passing nil to makeBaseSession eliminates the partial *auth.Identity fabrication without introducing nil dereferences — identityRoundTripper.RoundTrip already guards with i.fallbackIdentity != nil, makeBaseSession/initOneBackend propagate nil safely to the connector, and RestoreHijackPrevention takes no identity argument. The field-completeness contract added to auth.Identity godoc is a well-placed marker against this bug class recurring. The renamed test (TestRestoreSession_PassesNilIdentityToConnector) correctly pins the new contract and adds a connectorCalled guard to prevent a false-pass nil assertion when the connector is never invoked. This is a clean, minimal fix.

Two findings survived consensus scoring — both documentation/naming issues, not correctness problems:

F2 — Issue #5336 acceptance criterion 2 says "MetadataKeyIdentitySubject is preserved across restore." The constant MetadataKeyIdentitySubject does not exist in the codebase; the subject is encoded in MetadataKeyIdentityBinding and is preserved at factory.go:471. The fix is correct, but a reviewer auditing the issue's AC list against the diff cannot close this item without significant investigation. One sentence in the "Special notes for reviewers" section resolves this — e.g., "AC2's reference to MetadataKeyIdentitySubject describes a code pattern from the pre-PR code, not a real constant. The subject is encoded in MetadataKeyIdentityBinding, which is preserved at factory.go:471 — confirmed by TestRestoreSession_PopulatesBothSubjectFieldAndClaims line 185."

F1 — see inline comment.


Generated with Claude Code

Comment thread pkg/vmcp/session/identity_binding_test.go
TestRestoreSession_PopulatesBothSubjectFieldAndClaims described behavior
removed by the partial-identity fix: RestoreSession no longer populates
Subject or Claims on a connector identity. The test body verifies that
the stored identity binding round-trips through session metadata and
that the security decorator accepts a matching caller.

Addresses #5650 review comments:
- MEDIUM pkg/vmcp/session/identity_binding_test.go (3477948930): rename stale test name
@tgrunnagle

Copy link
Copy Markdown
Contributor Author

F2 (review body): Updated the PR description's "Special notes for reviewers" to explicitly state that MetadataKeyIdentitySubject is not a real constant — the subject is encoded in MetadataKeyIdentityBinding and preserved at factory.go:471, confirmed by TestRestoreSession_PreservesIdentityBindingAndAcceptsMatchingCaller (line 185).

loadSession previously created context.Background() before calling
RestoreSession, discarding the incoming request's identity. Backend
connectors capture an identityRoundTripper whose fallback is nil (after
#5336), so the Initialize handshake during cross-pod restore ran
completely unauthenticated — backends requiring upstreamInject /
tokenExchange / aws_sts auth on Initialize silently lost their
capabilities from the restored session.

The fix threads context.Context through the call chain:

  ValidatingCache.Get(ctx, key)
    → load func(ctx, key) / check func(ctx, key, val)
    → loadSession(ctx, sessionID)
    → RestoreSession(context.WithoutCancel(ctx), ...)

context.WithoutCancel strips the caller's cancellation signal while
preserving its values (including *auth.Identity). This is necessary
because ValidatingCache deduplicates concurrent Get calls via
singleflight: if the triggering request cancels mid-restore, other
waiters must still receive a result. Independent per-operation timeouts
(restoreStorageTimeout / restoreSessionTimeout) still apply on top.

Changes:
- pkg/cache: ValidatingCache.Get, load, and check accept context.Context
- pkg/vmcp/server: SessionManager.GetMultiSession(ctx, id)
- pkg/vmcp/server/sessionmanager: GetMultiSession, loadSession, and
  checkSession accept context; use context.WithoutCancel for storage ops
- pkg/vmcp/server: enforceSessionBinding passes handler ctx through;
  DecorateSession (session setup, not live request) uses Background()

Implements part of issue #5336.
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/S Small PR: 100-299 lines changed size/M Medium PR: 300-599 lines changed labels Jun 25, 2026
jhrozek
jhrozek previously approved these changes Jun 26, 2026

@jhrozek jhrozek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at this despite the draft status — it's in good shape.

Traced the context propagation chain all the way through the mark3labs SDK to confirm Bug 2 is actually fixed: c.Initialize(ctx) does propagate the restore ctx to http.NewRequestWithContext for both streamable-HTTP and SSE transports, so the caller's identity reaches authRoundTripper.RoundTrip during cross-pod restore. The nil-identity fix for Bug 1 is clean and well-tested.

The honest caveat is already captured in the e2e TODO: end-to-end authenticated restore for upstreamInject also needs the sibling UpstreamTokens work from #5323 to land. But that's a follow-up, not a gap in this PR.

LGTM.

@tgrunnagle tgrunnagle left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-Agent Consensus Review

Agents consulted: Security & Authentication, Concurrency & Context Propagation, API & Interface Design, Test Coverage, General Code Quality

Consensus Summary

# Finding Consensus Severity Action
F1 DecorateSession: no comment explaining context.Background() is intentional 9/10 LOW Add comment
F2 ValidatingCache.Get: context param + singleflight winner semantics undocumented 8/10 LOW Add godoc
F3 No test verifying identity propagation through loadSession to RestoreSession 7/10 MEDIUM Add test
F4 SessionManager interface doc leaks context.WithoutCancel implementation detail 7/10 LOW Rephrase

Overall

This is a focused, two-layer bug fix for cross-pod vMCP session restore failures with upstream-auth strategies. Layer A removes a partial-identity footgun: RestoreSession was constructing &auth.Identity{Subject, Claims} with empty Token/UpstreamTokens, violating the field-completeness invariant and silently supplying empty credentials to outgoing-auth strategies. The fix — passing nil — is correct. Layer B fixes the structural root cause: GetMultiSession had no context.Context, so loadSession's RestoreSession call always ran with context.Background(), losing the incoming request's identity. The context-threading approach is sound, and the use of context.WithoutCancel to preserve identity values while isolating the restore from caller cancellation correctly handles the singleflight coalescing concern.

The one substantive gap is a missing unit test for Layer B's core mechanism: the context.WithoutCancel identity-preservation path has no automated coverage. Issue #5336's acceptance criterion — "Backend Initialize handshakes during RestoreSession are authenticated when a live identity is available" — is satisfied by code logic but not verified by any test. Reversing the WithoutCancel fix in loadSession would cause zero test failures. The three LOW findings are documentation improvements to explain non-obvious design decisions that all reviewers independently flagged.

Documentation

  • pkg/cache/validating_cache.go Get godoc: should describe what ctx is used for and the singleflight winner-takes-context semantic (inline comment F2)
  • pkg/vmcp/server/session_manager_interface.go line 33: interface comment describes context.WithoutCancel — an implementation detail of Manager, not a contract for implementors (inline comment F4)
  • pkg/vmcp/server/sessionmanager/session_manager.go line 768: DecorateSession's context.Background() should be explained (inline comment F1)
  • docs/arch/13-vmcp-scalability.md: worth a quick scan to confirm it still accurately reflects session restore behavior after these changes

Generated with Claude Code

Comment thread pkg/vmcp/server/sessionmanager/session_manager.go
Comment thread pkg/cache/validating_cache.go
Comment thread pkg/vmcp/server/sessionmanager/session_manager_test.go
Comment thread pkg/vmcp/server/session_manager_interface.go Outdated
Addresses #5650 review comments:
- LOW pkg/vmcp/server/sessionmanager/session_manager.go (3482265055): add comment explaining context.Background() in DecorateSession is intentional — session setup, not a live authenticated request
- LOW pkg/cache/validating_cache.go (3482265071): document that only the first singleflight caller's context is forwarded to load/check
- LOW pkg/vmcp/server/session_manager_interface.go (3482265078): rewrite GetMultiSession doc as caller/implementor contract, not an implementation detail of Manager
…ession

Addresses #5650 review comments:
- MEDIUM pkg/vmcp/server/sessionmanager/session_manager_test.go (3482265074): reversing context.WithoutCancel(ctx) to context.Background() in loadSession previously caused zero test failures; this test catches that regression
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Jun 26, 2026
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Jun 26, 2026
@tgrunnagle tgrunnagle marked this pull request as ready for review June 26, 2026 19:32
@tgrunnagle

Copy link
Copy Markdown
Contributor Author

Created #5658 to track adding e2e validation of upstream inject after cross-pod restore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[vMCP] RestoreSession reconstructs a partial auth.Identity (Subject only), breaks cross-pod failover with upstream-auth strategies

2 participants