Support GROUP BY GROUPING SETS / ROLLUP / CUBE in the multi-stage query engine by xiangfu0 · Pull Request #18817 · apache/pinot

xiangfu0 · 2026-06-20T08:46:20Z

Summary

Adds native execution of GROUP BY GROUPING SETS (...) / ROLLUP(...) / CUBE(...) and the
GROUPING() / GROUPING_ID() functions to the multi-stage query engine (MSE), building on the
single-stage support in #18662.

Two execution paths, both producing identical results:

Aggregate directly over a table scan — the whole aggregate is pushed down to the single-stage (leaf)
engine, which expands each row across the grouping sets and appends the synthetic $groupingId discriminator.
Reuses the single-stage engine wholesale (the Doris/StarRocks backend-expansion model).
Aggregate over any other input (e.g. after a JOIN) — a new RepeatOperator in the multi-stage runtime
performs the same per-set row expansion (NULLing rolled-up columns, appending $groupingId), then an ordinary
GROUP BY over [union keys…, $groupingId] runs — no grouping-set-specific aggregation logic.

Plan split (PinotAggregateExchangeNodeInsertRule): LEAF → EXCHANGE(hash union+$groupingId) → FINAL(group by union+$groupingId) → PROJECT. The PROJECT computes GROUPING() / GROUPING_ID() from
$groupingId (bit extraction), mirroring the single-stage post-aggregation handler, and drops $groupingId.
GROUPING / GROUPING_ID are registered in PinotOperatorTable, and explicit
GROUP BY GROUPING SETS ((a, b), …) tuple syntax is accepted by the validator.

Tests

GroupingSetsQueriesTest — grouping-set aggregation, GROUPING() / GROUPING_ID() (SELECT and HAVING),
filtered aggregation, composite/nested sets, explicit GROUPING SETS syntax, and grouping sets over a JOIN —
run against both engines (the order-independent cases via the useBothQueryEngines data provider), plus the
single-stage genuine-vs-rolled-up-NULL discrimination cases. All green.

Known limitations (rejected with a clear error → never wrong results)

The v2 physical planner (usePhysicalOptimizer=true, opt-in, default off) still rejects grouping sets; the
default planner is fully supported. (Follow-up.)
More than 31 distinct grouping columns (the $groupingId bitmask is a 32-bit int) — same cap as the
single-stage engine.
Leaf group-trim (ORDER BY … LIMIT pushdown) is not pushed for grouping-set queries; results are still correct
(the broker applies the final ORDER BY + LIMIT).

…ion until supported Foundation for GROUP BY GROUPING SETS / ROLLUP / CUBE in the multi-stage engine. - AggregateNode now carries per-set membership bitmasks over its union group keys (plan.proto field 8 + serde), mirroring the single-stage PinotQuery.groupingSetMasks so the per-set row expansion can later be pushed down to the single-stage (leaf) engine. - Until the leaf pushdown + final-stage merge land, both AggregateNode converters reject ROLLUP/CUBE/GROUPING SETS with a clear error instead of silently collapsing to a plain GROUP BY (which dropped the subtotal/grand-total rows). These queries run on the single-stage engine, which supports them natively.

…ion to the single-stage leaf Implements multi-stage execution of grouping-set aggregates by pushing the per-set row expansion down to the single-stage (leaf) engine — the same backend-expansion model used by Doris/StarRocks — reusing the single-stage GROUPING SETS support (apache#18662) wholesale. Plan split (PinotAggregateExchangeNodeInsertRule): a grouping-set aggregate becomes LEAF -> EXCHANGE -> FINAL -> PROJECT - LEAF carries the grouping sets and emits the synthetic $groupingId discriminator column after the union group keys (PinotLogicalAggregate.deriveRowType). - EXCHANGE hash-partitions by the union keys AND $groupingId so each (set, key) co-locates. - FINAL is a plain SIMPLE aggregate grouping by [union keys..., $groupingId], so rows from different grouping sets stay distinct with no grouping-set-specific merge logic. - PROJECT drops $groupingId to restore the original aggregate row type. Leaf pushdown (ServerPlanRequestVisitor): sets PinotQuery.groupingSetMasks so the single-stage engine does the expansion + $groupingId, identical to the single-stage path. RelToPlanNodeConverter encodes Calcite groupSets into the AggregateNode masks (carried via the plan.proto field added earlier). The v2 physical planner still rejects grouping sets. Verified by MSE-vs-single-stage parity integration tests for ROLLUP, CUBE, mixed plain+ROLLUP, single-column ROLLUP, and ROLLUP+HAVING (GroupingSetsQueriesTest#testMultiStage*). Known limitations (follow-ups): GROUPING() / GROUPING_ID(), explicit GROUP BY GROUPING SETS ((a,b),...) tuple syntax (parsed as a ROW by the multi-stage validator), the v2 physical planner, leaf group trim, and >31 grouping columns are rejected with a clear error so the query runs on the single-stage engine instead.

…ROLLUP / CUBE Computes GROUPING() / GROUPING_ID() in the multi-stage engine from the $groupingId discriminator, mirroring the single-stage post-aggregation handler. - Register SqlStdOperatorTable.GROUPING and GROUPING_ID in PinotOperatorTable so the validator resolves them. - In the grouping-set plan split, GROUPING() / GROUPING_ID() are not pushed to the LEAF/FINAL aggregations (they are not real aggregations). They are split out and computed in the final PROJECT as bit expressions over the $groupingId column: GROUPING(col) = bitAnd( bitShiftRightUnsigned($groupingId, k), 1), packed with the first argument as the most significant bit (k = the column's index in the union group keys, matching the single-stage bit convention where a set bit means the column is rolled up). Real aggregate results are referenced from the FINAL output; $groupingId is dropped. Verified by MSE-vs-single-stage parity tests for GROUPING()/GROUPING_ID() in SELECT and in HAVING (GroupingSetsQueriesTest#testMultiStageGroupingFunctionsParity, testMultiStageGroupingInHavingParity), with all existing ROLLUP/CUBE parity tests still green.

… syntax The multi-stage validator rejected explicit GROUP BY GROUPING SETS ((a, b), (a), ()) because the parenthesized grouping-set tuples are parsed as ROW expressions, and RowExpressionValidationVisitor only allowed ROW operands under VALUES / INSERT / ARRAY constructors. Allow them under the GROUPING_SETS / ROLLUP / CUBE constructs as well (the tuples are grouping sets, not row constructors). ROLLUP / CUBE already worked because their operands are bare column references. With this, explicit GROUPING SETS syntax flows through to the leaf-pushdown execution path that already handles grouping sets, matching the single-stage engine. Broaden the multi-stage parity tests: explicit GROUPING SETS syntax, a composite grouping-set level ((a, b)), and filtered aggregation combined with grouping sets.

Now that the multi-stage engine supports GROUP BY GROUPING SETS / ROLLUP / CUBE, run the order-independent aggregation and GROUPING() / GROUPING_ID() tests on both engines via the useBothQueryEngines data provider (verifying correctness directly on each engine, replacing the v1-vs-v2 parity tests, which are now redundant). Kept single-stage only, with comments, the tests that are inherently engine-specific: null-handling-disabled queries (the engines differ on reading genuine NULLs), multi-value grouping columns (rejected as an intermediate-stage key in the multi-stage engine), the single-stage ORDER BY comparator paths, and the compile-time rejection cases.

…t rejected The increment that added multi-stage GROUPING SETS / ROLLUP / CUBE execution made the old QueryCompilationTest#testGroupingSetsRejectedInMultiStage assertion stale (it expected a rejection error). Replace it with testGroupingSetsSupportedInMultiStage, asserting these queries — including GROUPING() / GROUPING_ID() and explicit GROUPING SETS tuple syntax — now plan successfully.

codecov-commenter · 2026-06-20T11:00:03Z

Codecov Report

❌ Patch coverage is 61.38996% with 100 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.75%. Comparing base (0bceb35) to head (4187531).
⚠️ Report is 7 commits behind head on master.

Files with missing lines	Patch %	Lines
...r/physical/v2/opt/rules/AggregatePushdownRule.java	0.00%	61 Missing and 1 partial ⚠️
...inot/query/runtime/operator/AggregateOperator.java	11.76%	14 Missing and 1 partial ⚠️
...e/pinot/query/runtime/operator/RepeatOperator.java	84.00%	7 Missing and 1 partial ⚠️
...el/rules/PinotAggregateExchangeNodeInsertRule.java	93.50%	1 Missing and 4 partials ⚠️
...y/planner/physical/v2/nodes/PhysicalAggregate.java	40.00%	1 Missing and 2 partials ⚠️
...not/query/runtime/operator/MultiStageOperator.java	25.00%	3 Missing ⚠️
...he/pinot/query/planner/plannode/AggregateNode.java	75.00%	0 Missing and 2 partials ⚠️
.../runtime/plan/server/ServerPlanRequestVisitor.java	0.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18817      +/-   ##
============================================
+ Coverage     64.74%   64.75%   +0.01%     
  Complexity     1319     1319              
============================================
  Files          3390     3393       +3     
  Lines        210693   211173     +480     
  Branches      33070    33155      +85     
============================================
+ Hits         136414   136750     +336     
- Misses        63287    63396     +109     
- Partials      10992    11027      +35

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-21	`64.75% <61.38%> (+0.01%)`	⬆️
temurin	`64.75% <61.38%> (+0.01%)`	⬆️
unittests	`64.75% <61.38%> (+0.01%)`	⬆️
unittests1	`56.95% <61.38%> (+0.03%)`	⬆️
unittests2	`37.15% <6.94%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

xiangfu0

Found one high-signal compatibility issue; see inline comment.

xiangfu0 · 2026-06-20T12:11:20Z

+    for (int i = 0; i < groupCount; i++) {
+      builder.add(fields.get(i));
+    }
+    builder.add(GroupingSets.GROUPING_ID_COLUMN, typeFactory.createSqlType(SqlTypeName.INTEGER));


This changes the leaf-stage wire row layout, but there is no mixed-version gate for the new broker/server contract. Older servers will ignore AggregateNode.groupingSets, build a plain grouped leaf PinotQuery, and return the pre-$groupingId shape during a rolling upgrade. That breaks Pinot's mixed-version broker/server guarantee for this feature. Can we gate this plan shape on worker capability, or fall back to single-stage / explicit failure until all servers are upgraded?

…a a RepeatOperator Grouping sets were only executable when the aggregate sat directly on a table scan (pushed down to the single-stage leaf, which expands rows per set). Over a JOIN (or any intermediate-stage input) the aggregate runs in the multi-stage runtime, which could not expand rows. Add RepeatOperator: for each input row and grouping set it emits one row with the rolled-up columns set to NULL and the synthetic $groupingId discriminator appended (bit i set iff union column i is rolled up, matching the single-stage convention). The multi-stage equivalent of the single-stage per-set expansion / Doris's RepeatNode. AggregateOperator wraps the input of a grouping-set aggregate in a RepeatOperator and runs an ordinary GROUP BY over the union columns plus $groupingId, so no grouping-set-specific aggregation logic is needed and the existing FINAL merge + GROUPING()/GROUPING_ID() projection handle the rest. The efficient single-stage leaf pushdown is preserved for aggregates directly over a scan (that path does not build AggregateOperator). Verified end-to-end: testMultiStageRollupOverJoin and testMultiStageGroupingOverJoin (self-join + ROLLUP / GROUPING) alongside the existing scan-input parity tests.

… physical planner The v2 physical optimizer (usePhysicalOptimizer=true) runs its own AggregatePushdownRule instead of the default planner's PinotAggregateExchangeNodeInsertRule, so grouping-set aggregates were previously rejected on the v2 path. Mirror the single-stage split here: - PhysicalAggregate.deriveRowType() appends the synthetic $groupingId INT discriminator after the union group-by columns for LEAF non-SIMPLE aggregates. - AggregatePushdownRule splits a grouping-set aggregate into LEAF (carries the grouping sets) -> EXCHANGE (repartitions by the union keys plus $groupingId) -> FINAL (SIMPLE aggregate grouping by [union keys, $groupingId]) -> PROJECT (drops $groupingId). The per-set row expansion itself happens at runtime in the planner-agnostic RepeatOperator. - PRelToPlanNodeConverter no longer rejects grouping sets; the masks flow to the runtime. Guards mirror the single-stage rule: reject more than MAX_GROUPING_SET_COLUMNS (31) distinct grouping columns (the $groupingId bitmask is 32-bit), and reject GROUPING() / GROUPING_ID() over grouping sets on the v2 path for now (run those on the default planner). Tested via GroupingSetsQueriesTest: testV2PhysicalPlannerRollup verifies the genuine vs rolled-up NULL discrimination through the v2 plan, and testV2PhysicalPlannerRejectsGroupingFunction confirms the v2 path is actually engaged (the default planner accepts GROUPING()).

xiangfu0

Found one high-signal issue in the new v2 grouping-sets path; see inline comment.

xiangfu0 · 2026-06-21T12:15:36Z

      hintOptions = Map.of();
    }
    boolean isInputExchange = call._currentNode.unwrap().getInput(0) instanceof Exchange;
+    if (aggRel.getGroupType() != Aggregate.Group.SIMPLE) {


This new GROUPING SETS branch now returns addPartialAggregateForGroupingSets(...) before it checks withinGroupCollation, but the logical planner explicitly avoids leaf/final splitting when a WITHIN GROUP ordering is present. With usePhysicalOptimizer=true, a query like LISTAGG(...) WITHIN GROUP (...) under ROLLUP/CUBE can now be partially aggregated without preserving the required order, which is a silent wrong-result risk instead of the previous planner fallback. Can we keep the same withinGroupCollation escape hatch here and skip partial aggregation when ordered aggregates are involved?

Parameterize the v2 physical-planner ROLLUP test with the useBothQueryEngines data provider. On the multi-stage engine usePhysicalOptimizer engages the v2 split; the single-stage engine ignores the flag and runs the same ROLLUP, so the test now also cross-checks both engines return identical rows.

…vider Replace the hardcoded setUseMultiStageQueryEngine(true/false) calls on single-engine tests with the useV1QueryEngine / useV2QueryEngine data providers, so engine selection is declarative and uniform across the suite (matching the useBothQueryEngines tests). Single-stage-specific tests (null-handling-off, multi-value keys, single-stage ORDER BY, compile-time rejections) use useV1QueryEngine; multi-stage-specific tests (grouping sets over a JOIN, the v2 physical planner cases) use useV2QueryEngine. No behavior change.

Code review + cleanup, no behavior change: - Centralize the $groupingId LEAF row-type layout in GroupingSets.appendGroupingIdColumn and call it from both PinotLogicalAggregate and PhysicalAggregate, so the default and v2 planners can no longer drift on the column layout. - Add RepeatOperatorTest covering ROLLUP, CUBE, and a non-contiguous explicit grouping set (verifies the rolled-up columns are NULLed and the $groupingId discriminator is correct). - Note the rolling-upgrade behavior of the new plan.proto groupingSets field (upgrade servers before brokers; an old server runs a plain GROUP BY and drops the subtotal rows). - Drop the runRows test helper in favor of the inherited postQuery; remove vendor product references from comments; use /// markdown doc comments consistently.

xiangfu0 added 6 commits June 19, 2026 19:07

xiangfu0 marked this pull request as ready for review June 20, 2026 10:08

xiangfu0 commented Jun 20, 2026

View reviewed changes

xiangfu0 force-pushed the mse-grouping-sets branch from 513b62b to 616c781 Compare June 21, 2026 02:39

xiangfu0 commented Jun 21, 2026

View reviewed changes

xiangfu0 added 3 commits June 21, 2026 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support GROUP BY GROUPING SETS / ROLLUP / CUBE in the multi-stage query engine#18817

Support GROUP BY GROUPING SETS / ROLLUP / CUBE in the multi-stage query engine#18817
xiangfu0 wants to merge 11 commits into
apache:masterfrom
xiangfu0:mse-grouping-sets

xiangfu0 commented Jun 20, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 20, 2026 •

edited

Loading

Uh oh!

xiangfu0 left a comment

Uh oh!

xiangfu0 Jun 20, 2026

Uh oh!

xiangfu0 left a comment

Uh oh!

xiangfu0 Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xiangfu0 commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Known limitations (rejected with a clear error → never wrong results)

Uh oh!

codecov-commenter commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

xiangfu0 left a comment

Choose a reason for hiding this comment

Uh oh!

xiangfu0 Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

xiangfu0 left a comment

Choose a reason for hiding this comment

Uh oh!

xiangfu0 Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xiangfu0 commented Jun 20, 2026 •

edited

Loading

codecov-commenter commented Jun 20, 2026 •

edited

Loading