Skip to content

[GATEWAY V2]: Fix NPE in ThinClientStoreModel.wrapInHttpRequest for hedged PK-scoped queries.#48432

Draft
jeet1995 wants to merge 17 commits intoAzure:mainfrom
jeet1995:AzCosmos_FixCloneMissingPkDef
Draft

[GATEWAY V2]: Fix NPE in ThinClientStoreModel.wrapInHttpRequest for hedged PK-scoped queries.#48432
jeet1995 wants to merge 17 commits intoAzure:mainfrom
jeet1995:AzCosmos_FixCloneMissingPkDef

Conversation

@jeet1995
Copy link
Member

@jeet1995 jeet1995 commented Mar 16, 2026

Problem

When a partition-key-scoped query is hedged to a secondary region on Gateway V2 (thin client), the hedged request fails with:

NullPointerException: Cannot invoke "PartitionKeyDefinition.getPaths()" 
because "partitionKeyDefinition" is null
    at PartitionKeyInternalHelper.getEffectivePartitionKeyBytes
    at ThinClientStoreModel.wrapInHttpRequest

Trigger conditions (all must be true):

  • Gateway V2 (thin client) mode
  • Multi-region account with availability strategy / hedging active
  • Partition key set on query request options (setPartitionKey)
  • Hedged request dispatched to secondary region (primary region slow or unresponsive)

Why it happens:

executeFeedOperationWithAvailabilityStrategy clones the RxDocumentServiceRequest via req.clone() for each hedged region. clone() copies partitionKeyInternal (PK value) and hasFeedRangeFilteringBeenApplied = true, but does not copy partitionKeyDefinition (PK schema). The hasFeedRangeFilteringBeenApplied = true flag prevents downstream code (populateHeadersAsync) from re-resolving the collection to set PK def. When ThinClientStoreModel.wrapInHttpRequest computes EPK bytes, it needs both — and NPEs on the null PK def.

Why only Gateway V2: GW V1 never reads partitionKeyDefinition from the request — it serializes the PK as a JSON header and the server resolves EPK. GW V2 must compute EPK client-side for the RNTBD binary protocol.

Why not point operations: Point operation hedging (wrapPointOperationWithAvailabilityStrategy) clones RequestOptions and builds a fresh request per region — no req.clone().

Fix (defense in depth)

  1. Primary — ThinClientStoreModel.performRequestInternal: Overridden to reactively resolve partitionKeyDefinition from the collection cache when missing. The collection is already cached by this point, so this is a cheap cache hit.

  2. Secondary — RxDocumentServiceRequest.clone(): Now copies partitionKeyDefinition alongside partitionKeyInternal, preventing the field from being silently dropped.

Testing

Direct repro (verified fail → pass)

Test Type Without fix With fix
ThinClientStoreModelTest.cloneShouldPreservePartitionKeyDefinition Unit FAIL — PK def null PASS
FaultInjectionServerErrorRuleOnGatewayV2Tests.faultInjectionServerErrorRuleTests_PkScopedQueryWithHedging E2E (fi-thinclient-multi-master) NPE — identical to customer stack trace PASS

The E2E test fault-injects a response delay in the primary region, triggering hedging to the secondary region. The hedged PK-scoped query on GW V2 exercises exactly the clone()wrapInHttpRequest path.

Holistic availability strategy coverage for thin client

Wired the FaultInjectionWithAvailabilityStrategy test suite into thin-client CI groups:

Test class Groups added
FITests_readAfterCreation fi-thinclient-multi-region, fi-thinclient-multi-master
FITests_writeAfterCreate fi-thinclient-multi-master
FITests_queryAfterCreation fi-thinclient-multi-region, fi-thinclient-multi-master
FITests_readManyAfterCreation fi-thinclient-multi-region, fi-thinclient-multi-master
FITests_readAllAfterCreation fi-thinclient-multi-region, fi-thinclient-multi-master

Added thin client endpoint assertion in FaultInjectionWithAvailabilityStrategyTestsBase.execute() — when COSMOS.THINCLIENT_ENABLED and COSMOS.HTTP2_ENABLED are set with ConnectionMode.GATEWAY, validates that requests targeted the thin client proxy endpoint (:10250/).

Test utility consolidation

Moved assertThinClientEndpointUsed to TestSuiteBase as a shared protected static utility. Removed 3 duplicate copies from FaultInjectionServerErrorRuleOnGatewayV2Tests, CosmosNotFoundTests, and ClientRetryPolicyE2ETestsWithGatewayV2.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@jeet1995 jeet1995 force-pushed the AzCosmos_FixCloneMissingPkDef branch 2 times, most recently from a2c92e2 to e8efae4 Compare March 16, 2026 22:59
@jeet1995 jeet1995 changed the title [GATEWAY V2]: Fix NPE in ThinClientStoreModel.wrapInHttpRequest for h… [GATEWAY V2]: Fix NPE in ThinClientStoreModel.wrapInHttpRequest for hedged PK-scoped queries. Mar 16, 2026
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995 jeet1995 force-pushed the AzCosmos_FixCloneMissingPkDef branch from e8efae4 to fbf0b6f Compare March 17, 2026 00:02
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

…edged PK-scoped queries

RxDocumentServiceRequest.clone() (used by hedging in executeFeedOperationWithAvailabilityStrategy)
was not copying partitionKeyDefinition. When a PK-scoped query is hedged to a secondary region,
the cloned request has partitionKeyInternal (PK value) but null partitionKeyDefinition (PK schema).
ThinClientStoreModel.wrapInHttpRequest needs both for client-side EPK computation, causing NPE:

  Cannot invoke PartitionKeyDefinition.getPaths() because partitionKeyDefinition is null
  at PartitionKeyInternalHelper.getEffectivePartitionKeyBytes
  at ThinClientStoreModel.wrapInHttpRequest

Root cause: RxDocumentServiceRequest.clone() did not copy partitionKeyDefinition.
This only manifests in GW V2 (thin client) because GW V1 never reads partitionKeyDefinition
from the request - it serializes the PK as a JSON header and lets the server resolve EPK.
Point operations are unaffected because they clone RequestOptions (not the request) for hedging.

Fix (defense in depth):
- Primary: ThinClientStoreModel overrides performRequestInternal to reactively resolve
  partitionKeyDefinition from the collection cache when missing on the request. Includes
  logger.warn for production observability when the fallback triggers.
- Secondary: RxDocumentServiceRequest.clone() now copies partitionKeyDefinition.

Test coverage:
- Unit test: cloneShouldPreservePartitionKeyDefinition (verified: fails without fix, passes with)
- E2E test: PK-scoped query with fault-injected response delay triggering hedging on GW V2
  (verified: NPE without fix, passes with fix)
- Wired FaultInjectionWithAvailabilityStrategy test suite (FITests_read/write/query/readMany/readAll)
  into fi-thinclient-multi-region and fi-thinclient-multi-master test groups for holistic coverage.
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995 jeet1995 force-pushed the AzCosmos_FixCloneMissingPkDef branch from fbf0b6f to 12ada5e Compare March 17, 2026 01:01
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

…ness

TenantWorkloadConfig.getConsistencyLevel() called valueOf(toUpperCase()) which
fails for wire-format values like 'BoundedStaleness' -> 'BOUNDEDSTALENESS'
(enum name is BOUNDED_STALENESS). Now falls back to matching the display name.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

FITests_queryAfterCreation test configs use ONE_SECOND_DURATION timeouts
designed for DIRECT mode. Thin client forces GATEWAY mode which has higher
latency through the proxy, causing 408 timeouts. Skip DIRECT configs when
thin client is enabled until proper GATEWAY variants are added.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@jeet1995
Copy link
Member Author

/azp run java - cosmos - ci

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 and others added 2 commits March 17, 2026 08:03
…strategy tests running

FI tests with availability strategy (hedging) must still run under thin client since
query/readAll hedging is core functionality being validated. Only skip DIRECT mode test
configs that have no availability strategy (baseline AllGood tests with tight timeouts).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ction modes

Availability strategy (hedging) is connection-mode agnostic. Tests should run under
thin client without skipping, regardless of whether the config specifies DIRECT or GATEWAY.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

…kip condition

The ThinClient_MultiMaster job was missing -DCOSMOS.HTTP2_ENABLED=true in
tests.yml, so thin client was running without HTTP/2. Added the flag to match
the ThinClient base job. Restored the full 3-way skip condition:
isThinClientEnabled && isHttp2Enabled && connectionMode == DIRECT.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

…_MultiRegion

Added detailed comment explaining why DIRECT mode tests are skipped under
thin client + HTTP/2: requests route through gateway proxy, DIRECT timeouts
are too tight, GATEWAY + availability strategy tests still run.

Also enabled -DCOSMOS.HTTP2_ENABLED=true for ThinClient_MultiRegion job
to match ThinClient and ThinClient_MultiMaster.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

…ng in thin client mode

RegionalRoutingContext objects were used as keys in CaseInsensitiveMap, which
hashes via toString(). Since toString() includes the mutable thinclientRegionalEndpoint,
two issues arose:

1. equals()/hashCode() included thinclientRegionalEndpoint which is set AFTER map
   insertion in addRoutingContexts(), corrupting map lookups for point operations.
   Fix: equals()/hashCode() now only use the immutable gatewayRegionalEndpoint.

2. getRegionName() creates new RegionalRoutingContext(gatewayEndpoint) for lookups,
   but these have thinclientRegionalEndpoint=null while stored keys have it set.
   CaseInsensitiveMap.convertKey() calls toString() producing different strings,
   so lookups fail and fallback returns first write region for ALL regions.
   This caused availability strategy hedging to target the same region twice
   (no actual cross-region hedging), and affected circuit breaker, session
   container, and diagnostics region identification.
   Fix: Use HashMap instead of CaseInsensitiveMap for RegionalRoutingContext keys,
   so lookups use equals()/hashCode() which only depend on gatewayRegionalEndpoint.

3. Deferred regionByEndpoint map insertion to after thinclient endpoints are set,
   and added null check on regionalRoutingContext before setThinclientRegionalEndpoint.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

…it breaker and PPAF

- Revert HashMap/deferred-insertion changes in LocationCache.java (keep only RegionalRoutingContext equals/hashCode fix)
- Add fi-thinclient-multi-master group to PerPartitionCircuitBreakerE2ETests (miscellaneousDocumentOperationHitsTerminalExceptionAcrossKRegionsGateway)
- Add fi-thinclient-multi-region group to PerPartitionAutomaticFailoverE2ETests (all 3 test methods)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

…oint

CaseInsensitiveMap.convertKey() calls toString().toLowerCase() on map keys.
When thinclientRegionalEndpoint is included in toString(), stored keys
(with thinclient set) differ from lookup keys (thinclient null), causing
getRegionName() to fail for all cross-region features in thin client mode.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 and others added 2 commits March 17, 2026 19:17
…nabled

Thin client only supports GATEWAY mode. DIRECT mode tests use mock
TransportClient which is not applicable to thin client routing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…yProxy

ThinClientStoreModel uses RNTBD binary encoding for request/response,
which is incompatible with the standard HTTP mock responses created by
setupHttpClientToReturnSuccessResponse(). Replace thinProxy with
gatewayProxy in PPAF GATEWAY mock paths so data requests use the same
mocked HttpClient with standard HTTP encoding. PPAF retry/failover
logic is transport-agnostic, so this correctly tests the failover
behavior without requiring RNTBD-encoded mock responses.

Also add COSMOS.THINCLIENT_ENABLED and COSMOS.HTTP2_ENABLED system
properties to fi-thinclient-multi-region and fi-thinclient-multi-master
Maven profiles to ensure Configs.isThinClientEnabled() returns true in
forked JVM.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Revert COSMOS.THINCLIENT_ENABLED and COSMOS.HTTP2_ENABLED from
pom.xml profiles — these come from tests.yml instead.

Add assertThinClientEndpointUsed validation to:
- PerPartitionCircuitBreakerE2ETests execute() failure and recovery
  flows to verify requests hit thin client proxy endpoint (:10250/)
- PerPartitionAutomaticFailoverE2ETests third test method (GATEWAY
  FaultInjection path) to verify thin client endpoint usage

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Exception responses during fault injection failure flow may not have
thin client endpoint in diagnostics. Only assert thin client endpoint
on success responses (cosmosException == null).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Apply same cosmosException == null guard to the recovery flow
assertion — recovery iterations can also have exception responses.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants