Fix MutableJsonIndexImpl corrupting doc ID mapping on invalid UTF-16 strings by spapin · Pull Request #18738 · apache/pinot

spapin · 2026-06-11T14:49:44Z

Why

MutableJsonIndexImpl#add extends _docIdMapping for a document's flattened records, then builds the posting lists and increments _nextFlattenedDocId. While building those posting lists it maintains a running _bytesSize (a memory heuristic used to cap the mutable index) via Guava's Utf8.encodedLength(...).

Utf8.encodedLength throws IllegalArgumentException when a string contains an unpaired UTF-16 surrogate. Because it's called inside computeIfAbsent — after _docIdMapping has already been extended but before _nextFlattenedDocId is incremented — a single malformed value aborts the build loop and leaves _docIdMapping and _nextFlattenedDocId permanently out of sync. _nextDocId still advances in the finally block, so every subsequent document's json_match result is silently shifted by one flattened doc id.

This behavior lasts until the segment is flushed. The non-mutable json indices do not estimate size, so its ids are in sync.

Fix

Route the size accounting through an estimateLength helper that falls back to String.getBytes(UTF_8).length (which never throws — malformed input is replaced) when Utf8.encodedLength throws. _bytesSize is only a heuristic, so an approximate length for a malformed string is acceptable; aborting indexing is not.

Testing

Added testAddRecordWithUnpairedSurrogateDoesNotShiftDocIds: indexes three records where the middle value is an unpaired surrogate (\uD800) and asserts name='first' → doc 0 and name='third' → doc 2. Fails before this change (doc ids shifted), passes after.

Fixes #18737

… surrogates Utf8.encodedLength throws on unpaired surrogates, aborting the posting-list build loop after _docIdMapping has already been extended. This leaves the two counters permanently out of sync, silently shifting all subsequent documents' json_match results by one. _bytesSize is only a memory heuristic, so a size calculation must never abort indexing. Wraps the calls in estimateLength(), which falls back to String.getBytes(UTF_8) — which never throws — for malformed strings. Fixes apache#18737

spapin · 2026-06-15T15:34:34Z

    Preconditions.checkState(_nextFlattenedDocId + numRecords >= 0, "Got more than %s flattened records",
        Integer.MAX_VALUE);
    for (int i = 0; i < numRecords; i++) {
      _docIdMapping.add(_nextDocId);


Here, we add all the records to _docIdMapping.

On line 138, the call to Utf8.encodedLength may crash the function before _postingListMap is also increased. This is the source of the discrepancy between the json index and the mapped docId. Fixing the estimator ensures the issue doesn't happen.

codecov-commenter · 2026-06-16T20:32:04Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.53%. Comparing base (e44daae) to head (a50a6c9).
⚠️ Report is 24 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18738      +/-   ##
============================================
- Coverage     64.54%   64.53%   -0.01%     
  Complexity     1305     1305              
============================================
  Files          3380     3380              
  Lines        209642   209645       +3     
  Branches      32776    32776              
============================================
- Hits         135304   135300       -4     
- Misses        63480    63486       +6     
- Partials      10858    10859       +1

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-21	`64.53% <100.00%> (-0.01%)`	⬇️
temurin	`64.53% <100.00%> (-0.01%)`	⬇️
unittests	`64.53% <100.00%> (-0.01%)`	⬇️
unittests1	`56.98% <0.00%> (-0.02%)`	⬇️
unittests2	`37.02% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

spapin commented Jun 15, 2026

View reviewed changes

raghavyadav01 requested a review from xiangfu0 June 16, 2026 19:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MutableJsonIndexImpl corrupting doc ID mapping on invalid UTF-16 strings#18738

Fix MutableJsonIndexImpl corrupting doc ID mapping on invalid UTF-16 strings#18738
spapin wants to merge 1 commit into
apache:masterfrom
spapin:spapin/fix-json-mutable-index-corruption

spapin commented Jun 11, 2026

Uh oh!

spapin Jun 15, 2026

Uh oh!

codecov-commenter commented Jun 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

spapin commented Jun 11, 2026

Why

Fix

Testing

Uh oh!

spapin Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov-commenter commented Jun 16, 2026 •

edited

Loading