Stop fabricating partial identity in RestoreSession#5650
Conversation
RestoreSession constructed &auth.Identity{Subject, Claims} with Token
and UpstreamTokens unset, then passed it to makeBaseSession where it
propagated to the identityRoundTripper fallback in each backend
connector. This violated the field-completeness contract on *auth.Identity
and was a latent bug for any future consumer (audit log, outgoing-auth
strategy) that assumes a non-nil *Identity is fully populated.
Implements changes for issue #5336:
- Remove partial identity construction from RestoreSession; pass nil to
makeBaseSession instead. Binding validation (binding.Parse) is kept to
catch corrupted stored bindings before any backend work is attempted.
- Add field-completeness contract to auth.Identity godoc: a non-nil
*Identity is always complete; nil represents the anonymous / no-identity
state, not a partially-initialised struct.
- Replace the test that pinned the old partial-identity behavior with one
that asserts nil is passed to the backend connector on restore.
Live tool calls are unaffected: TokenValidator.Middleware places a
fully-populated identity on req.Context() before requests reach the
backend connectors, and identityRoundTripper already only injects its
captured fallback when the request context carries none (see #5323).
- Update RestoreSession implementation doc comment to reflect that connectors now receive nil identity (the old comment said identity was reconstructed from the stored iss/sub binding, which is no longer true) - Add TODO comment to the cross-pod restore E2E test marking the upstream-auth extension from AC4 of issue #5336 as dependency-gated on a live OIDC provider in the test environment
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5650 +/- ##
=======================================
Coverage 70.34% 70.34%
=======================================
Files 649 649
Lines 66101 66175 +74
=======================================
+ Hits 46500 46553 +53
- Misses 16253 16271 +18
- Partials 3348 3351 +3 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
tgrunnagle
left a comment
There was a problem hiding this comment.
Multi-Agent Consensus Review
Agents consulted: security/auth, test-coverage, go-code-quality, architecture, general-quality (5 agents)
Consensus Summary
| # | Finding | Consensus | Severity | Action |
|---|---|---|---|---|
| F1 | Stale test name TestRestoreSession_PopulatesBothSubjectFieldAndClaims |
7/10 | MEDIUM | Fix |
| F2 | MetadataKeyIdentitySubject AC — connection to fix not documented | 9/10 | MEDIUM | Discuss |
Overall
The production-side fix is correct and complete. Passing nil to makeBaseSession eliminates the partial *auth.Identity fabrication without introducing nil dereferences — identityRoundTripper.RoundTrip already guards with i.fallbackIdentity != nil, makeBaseSession/initOneBackend propagate nil safely to the connector, and RestoreHijackPrevention takes no identity argument. The field-completeness contract added to auth.Identity godoc is a well-placed marker against this bug class recurring. The renamed test (TestRestoreSession_PassesNilIdentityToConnector) correctly pins the new contract and adds a connectorCalled guard to prevent a false-pass nil assertion when the connector is never invoked. This is a clean, minimal fix.
Two findings survived consensus scoring — both documentation/naming issues, not correctness problems:
F2 — Issue #5336 acceptance criterion 2 says "MetadataKeyIdentitySubject is preserved across restore." The constant MetadataKeyIdentitySubject does not exist in the codebase; the subject is encoded in MetadataKeyIdentityBinding and is preserved at factory.go:471. The fix is correct, but a reviewer auditing the issue's AC list against the diff cannot close this item without significant investigation. One sentence in the "Special notes for reviewers" section resolves this — e.g., "AC2's reference to MetadataKeyIdentitySubject describes a code pattern from the pre-PR code, not a real constant. The subject is encoded in MetadataKeyIdentityBinding, which is preserved at factory.go:471 — confirmed by TestRestoreSession_PopulatesBothSubjectFieldAndClaims line 185."
F1 — see inline comment.
Generated with Claude Code
TestRestoreSession_PopulatesBothSubjectFieldAndClaims described behavior removed by the partial-identity fix: RestoreSession no longer populates Subject or Claims on a connector identity. The test body verifies that the stored identity binding round-trips through session metadata and that the security decorator accepts a matching caller. Addresses #5650 review comments: - MEDIUM pkg/vmcp/session/identity_binding_test.go (3477948930): rename stale test name
|
F2 (review body): Updated the PR description's "Special notes for reviewers" to explicitly state that |
loadSession previously created context.Background() before calling RestoreSession, discarding the incoming request's identity. Backend connectors capture an identityRoundTripper whose fallback is nil (after #5336), so the Initialize handshake during cross-pod restore ran completely unauthenticated — backends requiring upstreamInject / tokenExchange / aws_sts auth on Initialize silently lost their capabilities from the restored session. The fix threads context.Context through the call chain: ValidatingCache.Get(ctx, key) → load func(ctx, key) / check func(ctx, key, val) → loadSession(ctx, sessionID) → RestoreSession(context.WithoutCancel(ctx), ...) context.WithoutCancel strips the caller's cancellation signal while preserving its values (including *auth.Identity). This is necessary because ValidatingCache deduplicates concurrent Get calls via singleflight: if the triggering request cancels mid-restore, other waiters must still receive a result. Independent per-operation timeouts (restoreStorageTimeout / restoreSessionTimeout) still apply on top. Changes: - pkg/cache: ValidatingCache.Get, load, and check accept context.Context - pkg/vmcp/server: SessionManager.GetMultiSession(ctx, id) - pkg/vmcp/server/sessionmanager: GetMultiSession, loadSession, and checkSession accept context; use context.WithoutCancel for storage ops - pkg/vmcp/server: enforceSessionBinding passes handler ctx through; DecorateSession (session setup, not live request) uses Background() Implements part of issue #5336.
jhrozek
left a comment
There was a problem hiding this comment.
Took a look at this despite the draft status — it's in good shape.
Traced the context propagation chain all the way through the mark3labs SDK to confirm Bug 2 is actually fixed: c.Initialize(ctx) does propagate the restore ctx to http.NewRequestWithContext for both streamable-HTTP and SSE transports, so the caller's identity reaches authRoundTripper.RoundTrip during cross-pod restore. The nil-identity fix for Bug 1 is clean and well-tested.
The honest caveat is already captured in the e2e TODO: end-to-end authenticated restore for upstreamInject also needs the sibling UpstreamTokens work from #5323 to land. But that's a follow-up, not a gap in this PR.
LGTM.
tgrunnagle
left a comment
There was a problem hiding this comment.
Multi-Agent Consensus Review
Agents consulted: Security & Authentication, Concurrency & Context Propagation, API & Interface Design, Test Coverage, General Code Quality
Consensus Summary
| # | Finding | Consensus | Severity | Action |
|---|---|---|---|---|
| F1 | DecorateSession: no comment explaining context.Background() is intentional |
9/10 | LOW | Add comment |
| F2 | ValidatingCache.Get: context param + singleflight winner semantics undocumented |
8/10 | LOW | Add godoc |
| F3 | No test verifying identity propagation through loadSession to RestoreSession |
7/10 | MEDIUM | Add test |
| F4 | SessionManager interface doc leaks context.WithoutCancel implementation detail |
7/10 | LOW | Rephrase |
Overall
This is a focused, two-layer bug fix for cross-pod vMCP session restore failures with upstream-auth strategies. Layer A removes a partial-identity footgun: RestoreSession was constructing &auth.Identity{Subject, Claims} with empty Token/UpstreamTokens, violating the field-completeness invariant and silently supplying empty credentials to outgoing-auth strategies. The fix — passing nil — is correct. Layer B fixes the structural root cause: GetMultiSession had no context.Context, so loadSession's RestoreSession call always ran with context.Background(), losing the incoming request's identity. The context-threading approach is sound, and the use of context.WithoutCancel to preserve identity values while isolating the restore from caller cancellation correctly handles the singleflight coalescing concern.
The one substantive gap is a missing unit test for Layer B's core mechanism: the context.WithoutCancel identity-preservation path has no automated coverage. Issue #5336's acceptance criterion — "Backend Initialize handshakes during RestoreSession are authenticated when a live identity is available" — is satisfied by code logic but not verified by any test. Reversing the WithoutCancel fix in loadSession would cause zero test failures. The three LOW findings are documentation improvements to explain non-obvious design decisions that all reviewers independently flagged.
Documentation
pkg/cache/validating_cache.goGetgodoc: should describe whatctxis used for and the singleflight winner-takes-context semantic (inline comment F2)pkg/vmcp/server/session_manager_interface.goline 33: interface comment describescontext.WithoutCancel— an implementation detail ofManager, not a contract for implementors (inline comment F4)pkg/vmcp/server/sessionmanager/session_manager.goline 768:DecorateSession'scontext.Background()should be explained (inline comment F1)docs/arch/13-vmcp-scalability.md: worth a quick scan to confirm it still accurately reflects session restore behavior after these changes
Generated with Claude Code
Addresses #5650 review comments: - LOW pkg/vmcp/server/sessionmanager/session_manager.go (3482265055): add comment explaining context.Background() in DecorateSession is intentional — session setup, not a live authenticated request - LOW pkg/cache/validating_cache.go (3482265071): document that only the first singleflight caller's context is forwarded to load/check - LOW pkg/vmcp/server/session_manager_interface.go (3482265078): rewrite GetMultiSession doc as caller/implementor contract, not an implementation detail of Manager
…ession Addresses #5650 review comments: - MEDIUM pkg/vmcp/server/sessionmanager/session_manager_test.go (3482265074): reversing context.WithoutCancel(ctx) to context.Background() in loadSession previously caused zero test failures; this test catches that regression
|
Created #5658 to track adding e2e validation of upstream inject after cross-pod restore |
Summary
Two related bugs caused authenticated backend calls to fail after cross-pod session restore (Redis + `replicas > 1`).
Bug 1 — partial identity fabrication: `RestoreSession` constructed `&auth.Identity{Subject, Claims}` with `Token` and `UpstreamTokens` always empty, violating the field-completeness invariant on `*auth.Identity`. This struct was captured as `identityRoundTripper.fallbackIdentity` in every backend connector, so it was a latent landmine for any future consumer that assumes a non-nil identity is fully populated. Fix: pass `nil` to `makeBaseSession` — live requests carry a fully-populated identity on `req.Context()` from `TokenValidator.Middleware`, so the fallback is never needed.
Bug 2 — backend Initialize unauthenticated during restore: `loadSession` (the cache-miss restore path) called `RestoreSession` with `context.Background()`, discarding the incoming request's identity. Backend MCP `Initialize` handshakes during restore therefore ran with no auth headers — backends requiring `upstreamInject` / `tokenExchange` / `aws_sts` on `Initialize` silently lost their capabilities from the restored session. Fix: thread `context.Context` from `GetMultiSession` → `loadSession` → `RestoreSession`, using `context.WithoutCancel(ctx)` to carry values (identity) without propagating the caller's cancellation signal (required because `singleflight` coalesces concurrent restores).
Fixes #5336
Type of change
Test plan
Changes
Does this introduce a user-facing change?
No visible change for single-replica deployments (never hit `RestoreSession`). For multi-replica Redis deployments with upstream-auth strategies: backend Initialize handshakes during cross-pod restore are now authenticated, so restored sessions correctly reflect all backend capabilities rather than silently dropping backends that require auth on Initialize.
Special notes for reviewers
Generated with Claude Code