Add E2E test for upstreamInject identity propagation after cross-pod Redis restore#5660
Conversation
Implements issue #5658 (Option B: Dex as in-cluster OIDC provider). - Add DexImage constant to centralized E2E image registry - Add deployDex/cleanupDex helpers to helpers.go: deploy Dex with mockCallback connector (auto-approves, no browser interaction) + NodePort service for external test-process access - Add InstrumentedMCPBackendScript: a Python Flask MCP server that correctly handles MCP JSON-RPC and logs every inbound Bearer token to a /stats endpoint - Add getEmbeddedASToken: performs the full OAuth2 PKCE authorization code flow from the test process against the embedded AS, rewriting in-cluster URLs to NodePort/port-forward addresses at each hop - Add "When cross-pod session restore preserves upstreamInject identity" Context in virtualmcp_redis_session_test.go that verifies the fix from #5650: context propagation through loadSession ensures the incoming identity (with UpstreamTokens from Dex, keyed by tsid in shared Redis) reaches RestoreSession, so upstreamInject can inject the Dex token into backend Initialize requests after cross-pod restore - Remove the TODO(#5336) comment; the test now covers this gap Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Mark InstrumentedBackendScript deprecated in favor of InstrumentedMCPBackendScript - Wrap Dex NodePort read in Eventually to handle async assignment - Fix rewriteURLBase to use exact hostname equality instead of substring Contains Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## vmcp-restoresession-identity-issue_5336 #5660 +/- ##
===========================================================================
- Coverage 70.34% 70.34% -0.01%
===========================================================================
Files 649 649
Lines 66175 66091 -84
===========================================================================
- Hits 46553 46490 -63
+ Misses 16271 16255 -16
+ Partials 3351 3346 -5 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
PR size has been reduced below the XL threshold. Thank you for splitting this up!
|
✅ PR size has been reduced below the XL threshold. The size review has been dismissed and this PR can now proceed with normal review. Thank you for splitting this up! |
1. Audience mismatch: the operator derives AllowedAudiences from oidcRef.ResourceURL; without an explicit ResourceURL it falls back to http://<vmcpName>.svc:4483 (the resource name, not the service name). Set ResourceURL = vmcpServiceURL explicitly so AllowedAudiences matches the Audience set in MCPOIDCConfigReference, satisfying the validator. 2. Backend container crashes: the proxy runner's configureContainer() calls WithArgs(MCPServer.Spec.Args), which overwrites the Args set in the PodTemplateSpec patch. Pass InstrumentedMCPBackendScript via v1beta1test.WithArgs() instead, leaving only the Command override (sh -c) in the patch where it is preserved. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Adding "net/url" to helpers.go caused the revive linter to flag existing "url" variable/parameter names as import-shadowing. Rename them to avoid the conflict: url -> healthURL in checkHTTPHealthReady, url -> rawURL in getAndDecodeJSON. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
tgrunnagle
left a comment
There was a problem hiding this comment.
Review Summary
| Finding | Severity | Score | Action |
|---|---|---|---|
F01: Pod B BearerTokenRequests >= 2 not load-bearing |
HIGH | 8/10 | Fix |
F02: rewriteURLBase allows port-mismatch bypass |
MEDIUM | 9/10 | Fix |
| F03: Intermediate bearer assertion satisfied by ListTools | MEDIUM | 7/10 | Fix |
F04: rand.Read return value discarded without comment |
MEDIUM | 7/10 | Fix |
| F05: HTTP response bodies not drained before close | LOW | 8/10 | Fix |
F06: registerOAuthClient errors not wrapped with %w |
LOW | 9/10 | Fix |
Holistic Assessment
The infrastructure design is solid — Dex with mockCallback, the 9-step PKCE flow, and InstrumentedMCPBackendScript are all well-executed and correctly model the out-of-cluster test process accessing in-cluster endpoints. The Dex helper mirrors deployRedis/cleanupRedis cleanly, and the Deprecated annotation on InstrumentedBackendScript correctly directs future tests to InstrumentedMCPBackendScript.
The critical issue (F01) is that the final assertion uses BearerTokenRequests >= 2, which can be satisfied by pod A's Initialize + ListTools before pod B's RestoreSession fires — meaning the test would pass even if context.WithoutCancel in loadSession were reverted to context.Background(). The fix is to assert on InitializeCalls >= 2 instead, which is incremented only on the initialize JSON-RPC method and is strictly load-bearing.
Generated with Claude Code
BearerTokenRequests is incremented by every backend request that carries a Bearer token — including ListTools, CallTool, and Initialize. On pod A, both the Initialize and the subsequent ListTools call inject the upstream token via upstreamInject, so BearerTokenRequests can reach 2 before pod B's RestoreSession fires. This means the final assertion was not load-bearing: the test passed even if context.WithoutCancel were reverted. InitializeCalls is incremented only on the "initialize" JSON-RPC method, so it reaches 2 only after both pod A's Initialize and pod B's RestoreSession- triggered Initialize succeed. Replace both assertions with InitializeCalls. Addresses #5660 review comments: - HIGH test/e2e/thv-operator/virtualmcp/virtualmcp_redis_session_test.go (3484255613): final assertion not load-bearing - MEDIUM test/e2e/thv-operator/virtualmcp/virtualmcp_redis_session_test.go (3484255624): intermediate assertion inflated by ListTools Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
F02 rewriteURLBase: the fallback path stripped ports and compared only hostnames, so same-hostname/different-port URLs would silently bypass the guard. Compare both hostname and port after splitting. F04 generatePKCEPair: document why rand.Read return values are discarded — crypto/rand.Read cannot return an error since Go 1.20. F05 getEmbeddedASToken: drain the three intermediate 302 response bodies (steps 3, 5, 7) before closing, per the go-style.md connection-reuse rule. F06 registerOAuthClient: wrap bare errors with fmt.Errorf so callers get call-site context (which DCR step failed). Addresses #5660 review comments: - MEDIUM helpers.go (3484255633): rewriteURLBase port-mismatch bypass - MEDIUM helpers.go (3484255648): rand.Read discard undocumented - LOW helpers.go (3484255652): response bodies not drained before close - LOW helpers.go (3484255643): bare errors in registerOAuthClient Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
ghcr.io/dexidp/dex:v2.42.0 is used by Context 5 (upstreamInject cross-pod restore test) but was not pre-pulled like the other test images. Add it to the parallel pull and kind load steps so the node doesn't go to the internet mid-test, consistent with every other image in the suite. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The proxy runner's configureContainer applies ReadOnlyRootFilesystem to the inner mcp container via the platform-aware security context. This causes pip install flask to fail — Python's tempfile module probes /tmp, /var/tmp, /usr/tmp, and / in order, and finds none writable. Mount a writable emptyDir volume at /tmp via the PodTemplateSpec patch so pip can create build-tracker temp directories during installation. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Summary
This PR is stacked on #5650 and adds only the E2E test infrastructure that was blocked by the lack of an in-cluster OIDC provider (see the `TODO(#5336)` removed here). All production code fixes are in #5650.
PR #5650 fixed two bugs in the cross-pod Redis session restore path for multi-replica vMCP deployments:
The existing cross-pod E2E tests exercised only anonymous sessions and could not catch regressions in the authenticated restore path. This PR adds the missing coverage.
Closes #5658
Type of change
Test plan
The new test (Context 5 of `virtualmcp_redis_session_test.go`) exercises the full authenticated restore path:
The `InitializeCalls` metric is incremented only on the `initialize` JSON-RPC method, making the threshold strictly load-bearing: it can reach 2 only after both pod A's Initialize and pod B's RestoreSession-triggered Initialize succeed with a valid upstream token. The test fails if `context.WithoutCancel` in `loadSession` is reverted to `context.Background()`.
Changes
Special notes for reviewers
Why Dex: The existing `DeployParameterizedOIDCServer` only serves JWKS/discovery for token validation. The embedded auth server also needs OAuth2 authorization and token endpoints (`/auth`, `/token`) to complete the upstream token exchange that drives `upstreamInject`. Dex with the `mockCallback` connector provides these without browser interaction.
OAuth2 flow from the test process: The embedded AS runs inside vMCP pods (in-cluster) and Dex is in-cluster. `getEmbeddedASToken` bridges this by combining port-forwards (for the embedded AS endpoints) and Dex's NodePort (for external reachability), rewriting in-cluster URLs to local addresses at each redirect hop of the PKCE flow.
Shared Redis: The test reuses the session-storage Redis instance for the embedded AS's upstream token store (different key prefix), avoiding a second Redis deployment.
CI: The workflow triggers automatically on PRs touching `test/e2e/thv-operator/**`. Ginkgo runs all tests in that tree with `--procs=8` and no label filter, so Context 5 runs as part of every CI check. The Dex image is now pre-loaded into Kind in the same parallel pull step as the other test images.
Generated with Claude Code