Skip to content

Fix MutableJsonIndexImpl corrupting doc ID mapping on invalid UTF-16 strings#18738

Open
spapin wants to merge 1 commit into
apache:masterfrom
spapin:spapin/fix-json-mutable-index-corruption
Open

Fix MutableJsonIndexImpl corrupting doc ID mapping on invalid UTF-16 strings#18738
spapin wants to merge 1 commit into
apache:masterfrom
spapin:spapin/fix-json-mutable-index-corruption

Conversation

@spapin

@spapin spapin commented Jun 11, 2026

Copy link
Copy Markdown

Why

MutableJsonIndexImpl#add extends _docIdMapping for a document's flattened records, then builds the posting lists and increments _nextFlattenedDocId. While building those posting lists it maintains a running _bytesSize (a memory heuristic used to cap the mutable index) via Guava's Utf8.encodedLength(...).

Utf8.encodedLength throws IllegalArgumentException when a string contains an unpaired UTF-16 surrogate. Because it's called inside computeIfAbsent — after _docIdMapping has already been extended but before _nextFlattenedDocId is incremented — a single malformed value aborts the build loop and leaves _docIdMapping and _nextFlattenedDocId permanently out of sync. _nextDocId still advances in the finally block, so every subsequent document's json_match result is silently shifted by one flattened doc id.

This behavior lasts until the segment is flushed. The non-mutable json indices do not estimate size, so its ids are in sync.

Fix

Route the size accounting through an estimateLength helper that falls back to String.getBytes(UTF_8).length (which never throws — malformed input is replaced) when Utf8.encodedLength throws. _bytesSize is only a heuristic, so an approximate length for a malformed string is acceptable; aborting indexing is not.

Testing

Added testAddRecordWithUnpairedSurrogateDoesNotShiftDocIds: indexes three records where the middle value is an unpaired surrogate (\uD800) and asserts name='first' → doc 0 and name='third' → doc 2. Fails before this change (doc ids shifted), passes after.

Fixes #18737

… surrogates

Utf8.encodedLength throws on unpaired surrogates, aborting the posting-list
build loop after _docIdMapping has already been extended. This leaves the two
counters permanently out of sync, silently shifting all subsequent documents'
json_match results by one.

_bytesSize is only a memory heuristic, so a size calculation must never
abort indexing. Wraps the calls in estimateLength(), which falls back to
String.getBytes(UTF_8) — which never throws — for malformed strings.

Fixes apache#18737
Preconditions.checkState(_nextFlattenedDocId + numRecords >= 0, "Got more than %s flattened records",
Integer.MAX_VALUE);
for (int i = 0; i < numRecords; i++) {
_docIdMapping.add(_nextDocId);

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we add all the records to _docIdMapping.

On line 138, the call to Utf8.encodedLength may crash the function before _postingListMap is also increased. This is the source of the discrepancy between the json index and the mapped docId. Fixing the estimator ensures the issue doesn't happen.

@raghavyadav01 raghavyadav01 requested a review from xiangfu0 June 16, 2026 19:39
@codecov-commenter

codecov-commenter commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.53%. Comparing base (e44daae) to head (a50a6c9).
⚠️ Report is 24 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18738      +/-   ##
============================================
- Coverage     64.54%   64.53%   -0.01%     
  Complexity     1305     1305              
============================================
  Files          3380     3380              
  Lines        209642   209645       +3     
  Branches      32776    32776              
============================================
- Hits         135304   135300       -4     
- Misses        63480    63486       +6     
- Partials      10858    10859       +1     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 64.53% <100.00%> (-0.01%) ⬇️
temurin 64.53% <100.00%> (-0.01%) ⬇️
unittests 64.53% <100.00%> (-0.01%) ⬇️
unittests1 56.98% <0.00%> (-0.02%) ⬇️
unittests2 37.02% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MutableJsonIndexImpl: unpaired UTF-16 surrogate in JSON value corrupts doc ID mapping

2 participants