Adds intermediate dataType to schema and use it for ingestion aggregation by noob-se7en · Pull Request #16868 · apache/pinot

noob-se7en · 2025-09-22T18:40:42Z

Problem
Related to #16317 . TLDR: When the ingestion aggregation/tranformation happens on source column not present in schema, There can be exceptions thrown which occur from data type conversions since there is no info of those source column as they are not present in the schema.
Example: Ingestion aggregation: sum(price) , Here if price column is not part of schema, Pinot assumes it to be as Number but it can be String in source.

PR
Add new intermediate field type like below to schema and use this info in ingestion aggregation.

  "intermediateFieldSpecs": [
    {
      "name": "price",
      "dataType": "STRING"
    }
  ],

Pending
Adding more tests. Opening this PR to get early reviews.

codecov-commenter · 2025-09-23T08:19:30Z

Codecov Report

❌ Patch coverage is 52.66667% with 71 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.23%. Comparing base (4ef8e3e) to head (c926962).
⚠️ Report is 332 commits behind head on master.

Files with missing lines	Patch %	Lines
...t/local/aggregator/MinMaxRangeValueAggregator.java	19.04%	12 Missing and 5 partials ⚠️
...t/segment/local/aggregator/AvgValueAggregator.java	15.78%	11 Missing and 5 partials ⚠️
...rc/main/java/org/apache/pinot/spi/data/Schema.java	18.75%	12 Missing and 1 partial ⚠️
...local/indexsegment/mutable/MutableSegmentImpl.java	75.00%	9 Missing and 3 partials ⚠️
...g/apache/pinot/spi/data/IntermediateFieldSpec.java	0.00%	3 Missing ⚠️
...t/segment/local/aggregator/MaxValueAggregator.java	75.00%	1 Missing and 1 partial ⚠️
...t/segment/local/aggregator/MinValueAggregator.java	75.00%	1 Missing and 1 partial ⚠️
.../local/aggregator/SumPrecisionValueAggregator.java	80.00%	1 Missing and 1 partial ⚠️
...t/segment/local/aggregator/SumValueAggregator.java	75.00%	1 Missing and 1 partial ⚠️
...segment/local/aggregator/ValueAggregatorUtils.java	66.66%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #16868      +/-   ##
============================================
- Coverage     63.25%   63.23%   -0.02%     
  Complexity     1499     1499              
============================================
  Files          3174     3176       +2     
  Lines        190323   190430     +107     
  Branches      29080    29096      +16     
============================================
+ Hits         120381   120422      +41     
- Misses        60606    60654      +48     
- Partials       9336     9354      +18

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-11	`63.18% <52.66%> (-0.03%)`	⬇️
java-21	`63.21% <52.66%> (-0.01%)`	⬇️
temurin	`63.23% <52.66%> (-0.02%)`	⬇️
unittests	`63.23% <52.66%> (-0.02%)`	⬇️
unittests1	`55.58% <25.33%> (-0.05%)`	⬇️
unittests2	`34.10% <50.66%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

noob-se7en · 2025-10-16T08:18:50Z

@Jackie-Jiang added intermediate field spec in schema:

Like:

  "intermediateFieldSpecs": [
    {
      "name": "random",
      "dataType": "STRING"
    }
  ],

9aman · 2025-11-03T06:35:32Z

@noob-se7en

Will it impact segment reload (due to schema change ) etc?
- It's impact on existing segments: Given that these are transformation at the time of ingestion, were we failing segment build for such scenarios (referring to the issues mentioned above) ?
- It's impact on pauseless ingestion i.e. scenarios of continued ingestion without segment build. Will we rely on DR here ?
How are we handling transformations for such scenarios ? Is the expectation that the column being transformed is part of the schema.

9aman · 2025-11-03T07:37:08Z

@noob-se7en

Will it impact segment reload (due to schema change ) etc?

It's impact on existing segments: Given that these are transformation at the time of ingestion, were we failing segment build for such scenarios (referring to the issues mentioned above) ?

It's impact on pauseless ingestion i.e. scenarios of continued ingestion without segment build. Will we rely on DR here ?

How are we handling transformations for such scenarios ? Is the expectation that the column being transformed is part of the schema.

I guess for transformation the ingestion itself, at row level, will throw exceptions and we won't wait till the segment build ?

noob-se7en · 2025-11-03T16:52:44Z

@noob-se7en

Will it impact segment reload (due to schema change ) etc?

It's impact on existing segments: Given that these are transformation at the time of ingestion, were we failing segment build for such scenarios (referring to the issues mentioned above) ?

It's impact on pauseless ingestion i.e. scenarios of continued ingestion without segment build. Will we rely on DR here ?

How are we handling transformations for such scenarios ? Is the expectation that the column being transformed is part of the schema.

I don't understand the questions fully. Code changes are only in MutableSegmentImpl.
It should not impact reload of segments right?

This PR is only meant for supporting realtime ingestion aggregation (which happens during indexing of mutable segments)

Jackie-Jiang

Well done.

Given the field type name cannot be changed in the future, do you see intermediate a common field type name in other DBs?

Jackie-Jiang · 2025-11-24T23:23:36Z

@@ -49,11 +49,28 @@ public interface ValueAggregator<R, A> {
  A getInitialAggregatedValue(@Nullable R rawValue);


Seems we can deprecate this method as long as A applyRawValue(A value, R rawValue);

Jackie-Jiang · 2025-11-24T23:24:19Z

+   * Returns the initial aggregated value with the optional source data type provided for correct raw value handling.
+   * Default implementation delegates to {@link #getInitialAggregatedValue(Object)} for backward compatibility.
+   */
+  default A getInitialAggregatedValue(@Nullable R rawValue, @Nullable DataType sourceDataType) {


Star-tree builder can also be switched to use the new set of methods

…aggregation

Jackie-Jiang · 2026-06-20T00:14:12Z

Taking a different approach in #18816, where user can add optional data type conversion for any source fields.

noob-se7en added 2 commits September 23, 2025 00:05

Adds source dataType for aggregation

75a9c0c

fix tests

89f6a40

noob-se7en added 2 commits September 23, 2025 15:45

Adds test

695a635

Adds fieldSpec to schema

3b0715b

noob-se7en added 4 commits October 27, 2025 19:44

updates aggregators

ad4a996

updates aggregators

0bd549c

fixes lint

2be512b

fixes lint

346c9ea

noob-se7en changed the title ~~Adds source dataType for aggregation~~ Adds intermediate dataType to schema and use it in ingestion aggregation Oct 30, 2025

noob-se7en marked this pull request as ready for review October 30, 2025 06:18

noob-se7en changed the title ~~Adds intermediate dataType to schema and use it in ingestion aggregation~~ Adds intermediate dataType to schema and use it for ingestion aggregation Oct 30, 2025

noob-se7en added 2 commits November 17, 2025 14:35

Fixes bug

d765a2d

Merge branch 'master' of github.com:apache/pinot into fix_aggregation

7e42a55

Jackie-Jiang added enhancement Improvement to existing functionality ingestion Related to data ingestion pipeline labels Nov 19, 2025

Jackie-Jiang reviewed Nov 24, 2025

View reviewed changes

noob-se7en and others added 3 commits December 28, 2025 14:37

merge master

a22c5cf

Merge branch 'master' of github.com:apache/pinot into fix_aggregation

15dcf7f

Merge branch 'fix_aggregation' of github.com:Harnoor7/pinot into fix_…

c926962

…aggregation

xiangfu0 added the schema Related to table schema definitions or changes label Mar 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds intermediate dataType to schema and use it for ingestion aggregation#16868

Adds intermediate dataType to schema and use it for ingestion aggregation#16868
noob-se7en wants to merge 13 commits into
apache:masterfrom
noob-se7en:fix_aggregation

noob-se7en commented Sep 22, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Sep 23, 2025 •

edited

Loading

Uh oh!

noob-se7en commented Oct 16, 2025

Uh oh!

9aman commented Nov 3, 2025

Uh oh!

9aman commented Nov 3, 2025

Uh oh!

noob-se7en commented Nov 3, 2025

Uh oh!

Jackie-Jiang left a comment

Uh oh!

Jackie-Jiang Nov 24, 2025

Uh oh!

Jackie-Jiang Nov 24, 2025

Uh oh!

Jackie-Jiang commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		@@ -49,11 +49,28 @@ public interface ValueAggregator<R, A> {
		A getInitialAggregatedValue(@Nullable R rawValue);

Conversation

noob-se7en commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

noob-se7en commented Oct 16, 2025

Uh oh!

9aman commented Nov 3, 2025

Uh oh!

9aman commented Nov 3, 2025

Uh oh!

noob-se7en commented Nov 3, 2025

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

Jackie-Jiang Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Jackie-Jiang Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Jackie-Jiang commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

noob-se7en commented Sep 22, 2025 •

edited

Loading

codecov-commenter commented Sep 23, 2025 •

edited

Loading