[GATEWAY V2]: Fix NPE in ThinClientStoreModel.wrapInHttpRequest for hedged PK-scoped queries.#48432
Draft
jeet1995 wants to merge 17 commits intoAzure:mainfrom
Draft
[GATEWAY V2]: Fix NPE in ThinClientStoreModel.wrapInHttpRequest for hedged PK-scoped queries.#48432jeet1995 wants to merge 17 commits intoAzure:mainfrom
jeet1995 wants to merge 17 commits intoAzure:mainfrom
Conversation
a2c92e2 to
e8efae4
Compare
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
e8efae4 to
fbf0b6f
Compare
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…edged PK-scoped queries RxDocumentServiceRequest.clone() (used by hedging in executeFeedOperationWithAvailabilityStrategy) was not copying partitionKeyDefinition. When a PK-scoped query is hedged to a secondary region, the cloned request has partitionKeyInternal (PK value) but null partitionKeyDefinition (PK schema). ThinClientStoreModel.wrapInHttpRequest needs both for client-side EPK computation, causing NPE: Cannot invoke PartitionKeyDefinition.getPaths() because partitionKeyDefinition is null at PartitionKeyInternalHelper.getEffectivePartitionKeyBytes at ThinClientStoreModel.wrapInHttpRequest Root cause: RxDocumentServiceRequest.clone() did not copy partitionKeyDefinition. This only manifests in GW V2 (thin client) because GW V1 never reads partitionKeyDefinition from the request - it serializes the PK as a JSON header and lets the server resolve EPK. Point operations are unaffected because they clone RequestOptions (not the request) for hedging. Fix (defense in depth): - Primary: ThinClientStoreModel overrides performRequestInternal to reactively resolve partitionKeyDefinition from the collection cache when missing on the request. Includes logger.warn for production observability when the fallback triggers. - Secondary: RxDocumentServiceRequest.clone() now copies partitionKeyDefinition. Test coverage: - Unit test: cloneShouldPreservePartitionKeyDefinition (verified: fails without fix, passes with) - E2E test: PK-scoped query with fault-injected response delay triggering hedging on GW V2 (verified: NPE without fix, passes with fix) - Wired FaultInjectionWithAvailabilityStrategy test suite (FITests_read/write/query/readMany/readAll) into fi-thinclient-multi-region and fi-thinclient-multi-master test groups for holistic coverage.
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
fbf0b6f to
12ada5e
Compare
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…ness TenantWorkloadConfig.getConsistencyLevel() called valueOf(toUpperCase()) which fails for wire-format values like 'BoundedStaleness' -> 'BOUNDEDSTALENESS' (enum name is BOUNDED_STALENESS). Now falls back to matching the display name. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
FITests_queryAfterCreation test configs use ONE_SECOND_DURATION timeouts designed for DIRECT mode. Thin client forces GATEWAY mode which has higher latency through the proxy, causing 408 timeouts. Skip DIRECT configs when thin client is enabled until proper GATEWAY variants are added. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run java - cosmos - tests |
Member
Author
|
/azp run java - cosmos - ci |
|
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
|
Azure Pipelines successfully started running 1 pipeline(s). |
…strategy tests running FI tests with availability strategy (hedging) must still run under thin client since query/readAll hedging is core functionality being validated. Only skip DIRECT mode test configs that have no availability strategy (baseline AllGood tests with tight timeouts). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ction modes Availability strategy (hedging) is connection-mode agnostic. Tests should run under thin client without skipping, regardless of whether the config specifies DIRECT or GATEWAY. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…kip condition The ThinClient_MultiMaster job was missing -DCOSMOS.HTTP2_ENABLED=true in tests.yml, so thin client was running without HTTP/2. Added the flag to match the ThinClient base job. Restored the full 3-way skip condition: isThinClientEnabled && isHttp2Enabled && connectionMode == DIRECT. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…_MultiRegion Added detailed comment explaining why DIRECT mode tests are skipped under thin client + HTTP/2: requests route through gateway proxy, DIRECT timeouts are too tight, GATEWAY + availability strategy tests still run. Also enabled -DCOSMOS.HTTP2_ENABLED=true for ThinClient_MultiRegion job to match ThinClient and ThinClient_MultiMaster. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…ng in thin client mode RegionalRoutingContext objects were used as keys in CaseInsensitiveMap, which hashes via toString(). Since toString() includes the mutable thinclientRegionalEndpoint, two issues arose: 1. equals()/hashCode() included thinclientRegionalEndpoint which is set AFTER map insertion in addRoutingContexts(), corrupting map lookups for point operations. Fix: equals()/hashCode() now only use the immutable gatewayRegionalEndpoint. 2. getRegionName() creates new RegionalRoutingContext(gatewayEndpoint) for lookups, but these have thinclientRegionalEndpoint=null while stored keys have it set. CaseInsensitiveMap.convertKey() calls toString() producing different strings, so lookups fail and fallback returns first write region for ALL regions. This caused availability strategy hedging to target the same region twice (no actual cross-region hedging), and affected circuit breaker, session container, and diagnostics region identification. Fix: Use HashMap instead of CaseInsensitiveMap for RegionalRoutingContext keys, so lookups use equals()/hashCode() which only depend on gatewayRegionalEndpoint. 3. Deferred regionByEndpoint map insertion to after thinclient endpoints are set, and added null check on regionalRoutingContext before setThinclientRegionalEndpoint. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…it breaker and PPAF - Revert HashMap/deferred-insertion changes in LocationCache.java (keep only RegionalRoutingContext equals/hashCode fix) - Add fi-thinclient-multi-master group to PerPartitionCircuitBreakerE2ETests (miscellaneousDocumentOperationHitsTerminalExceptionAcrossKRegionsGateway) - Add fi-thinclient-multi-region group to PerPartitionAutomaticFailoverE2ETests (all 3 test methods) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…oint CaseInsensitiveMap.convertKey() calls toString().toLowerCase() on map keys. When thinclientRegionalEndpoint is included in toString(), stored keys (with thinclient set) differ from lookup keys (thinclient null), causing getRegionName() to fail for all cross-region features in thin client mode. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…nabled Thin client only supports GATEWAY mode. DIRECT mode tests use mock TransportClient which is not applicable to thin client routing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…yProxy ThinClientStoreModel uses RNTBD binary encoding for request/response, which is incompatible with the standard HTTP mock responses created by setupHttpClientToReturnSuccessResponse(). Replace thinProxy with gatewayProxy in PPAF GATEWAY mock paths so data requests use the same mocked HttpClient with standard HTTP encoding. PPAF retry/failover logic is transport-agnostic, so this correctly tests the failover behavior without requiring RNTBD-encoded mock responses. Also add COSMOS.THINCLIENT_ENABLED and COSMOS.HTTP2_ENABLED system properties to fi-thinclient-multi-region and fi-thinclient-multi-master Maven profiles to ensure Configs.isThinClientEnabled() returns true in forked JVM. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Revert COSMOS.THINCLIENT_ENABLED and COSMOS.HTTP2_ENABLED from pom.xml profiles — these come from tests.yml instead. Add assertThinClientEndpointUsed validation to: - PerPartitionCircuitBreakerE2ETests execute() failure and recovery flows to verify requests hit thin client proxy endpoint (:10250/) - PerPartitionAutomaticFailoverE2ETests third test method (GATEWAY FaultInjection path) to verify thin client endpoint usage Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Exception responses during fault injection failure flow may not have thin client endpoint in diagnostics. Only assert thin client endpoint on success responses (cosmosException == null). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Apply same cosmosException == null guard to the recovery flow assertion — recovery iterations can also have exception responses. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a partition-key-scoped query is hedged to a secondary region on Gateway V2 (thin client), the hedged request fails with:
Trigger conditions (all must be true):
setPartitionKey)Why it happens:
executeFeedOperationWithAvailabilityStrategyclones theRxDocumentServiceRequestviareq.clone()for each hedged region.clone()copiespartitionKeyInternal(PK value) andhasFeedRangeFilteringBeenApplied = true, but does not copypartitionKeyDefinition(PK schema). ThehasFeedRangeFilteringBeenApplied = trueflag prevents downstream code (populateHeadersAsync) from re-resolving the collection to set PK def. WhenThinClientStoreModel.wrapInHttpRequestcomputes EPK bytes, it needs both — and NPEs on the null PK def.Why only Gateway V2: GW V1 never reads
partitionKeyDefinitionfrom the request — it serializes the PK as a JSON header and the server resolves EPK. GW V2 must compute EPK client-side for the RNTBD binary protocol.Why not point operations: Point operation hedging (
wrapPointOperationWithAvailabilityStrategy) clonesRequestOptionsand builds a fresh request per region — noreq.clone().Fix (defense in depth)
Primary —
ThinClientStoreModel.performRequestInternal: Overridden to reactively resolvepartitionKeyDefinitionfrom the collection cache when missing. The collection is already cached by this point, so this is a cheap cache hit.Secondary —
RxDocumentServiceRequest.clone(): Now copiespartitionKeyDefinitionalongsidepartitionKeyInternal, preventing the field from being silently dropped.Testing
Direct repro (verified fail → pass)
ThinClientStoreModelTest.cloneShouldPreservePartitionKeyDefinitionFaultInjectionServerErrorRuleOnGatewayV2Tests.faultInjectionServerErrorRuleTests_PkScopedQueryWithHedgingfi-thinclient-multi-master)The E2E test fault-injects a response delay in the primary region, triggering hedging to the secondary region. The hedged PK-scoped query on GW V2 exercises exactly the
clone()→wrapInHttpRequestpath.Holistic availability strategy coverage for thin client
Wired the
FaultInjectionWithAvailabilityStrategytest suite into thin-client CI groups:FITests_readAfterCreationfi-thinclient-multi-region,fi-thinclient-multi-masterFITests_writeAfterCreatefi-thinclient-multi-masterFITests_queryAfterCreationfi-thinclient-multi-region,fi-thinclient-multi-masterFITests_readManyAfterCreationfi-thinclient-multi-region,fi-thinclient-multi-masterFITests_readAllAfterCreationfi-thinclient-multi-region,fi-thinclient-multi-masterAdded thin client endpoint assertion in
FaultInjectionWithAvailabilityStrategyTestsBase.execute()— whenCOSMOS.THINCLIENT_ENABLEDandCOSMOS.HTTP2_ENABLEDare set withConnectionMode.GATEWAY, validates that requests targeted the thin client proxy endpoint (:10250/).Test utility consolidation
Moved
assertThinClientEndpointUsedtoTestSuiteBaseas a sharedprotected staticutility. Removed 3 duplicate copies fromFaultInjectionServerErrorRuleOnGatewayV2Tests,CosmosNotFoundTests, andClientRetryPolicyE2ETestsWithGatewayV2.All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines