[multistage] Support forcing colocated set operations to avoid data shuffle using query hint#18804
Open
yashmayya wants to merge 1 commit into
Open
[multistage] Support forcing colocated set operations to avoid data shuffle using query hint#18804yashmayya wants to merge 1 commit into
yashmayya wants to merge 1 commit into
Conversation
ea0f413 to
73180d9
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18804 +/- ##
============================================
+ Coverage 64.78% 64.80% +0.01%
Complexity 1309 1309
============================================
Files 3381 3386 +5
Lines 209967 210166 +199
Branches 32891 32923 +32
============================================
+ Hits 136020 136188 +168
- Misses 62979 63014 +35
+ Partials 10968 10964 -4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a new query hint
setOpOptions(is_colocated_by_set_op_keys='...')that forces (or disables) colocated, pre-partitioned exchanges for set operations (UNION/UNION ALL/INTERSECT/EXCEPT) in order to avoid a data shuffle when the inputs are already partitioned compatibly. This is the set-operation equivalent of the existingjoinOptions(is_colocated_by_join_keys='...')join hint and thewindowOptions(is_partitioned_by_window_keys='...')window hint (#17395).By default the planner inserts a hash exchange (on the full output row) below every input of a set operation. When the inputs are already co-partitioned, that shuffle is unnecessary; this hint lets the user assert colocation so the planner emits a direct (1-to-1, no-shuffle) exchange instead. This also registers the
setOpOptionshint strategy (HintPredicates.SETOP), which was previously not registered at all.Like the equivalent join / window hints, this is opt-in and trusts the user's assertion. Because a set operation matches rows on the entire output row, forcing
is_colocated_by_set_op_keys='true'is only correct when every input is partitioned the same way (same partition function and count) on one or more of the projected columns, so that rows that are equal across all projected columns land on the same worker. Forcing it on data that is not actually colocated will produce incorrect results forINTERSECT,EXCEPTand distinctUNION(UNION ALLonly concatenates, so it is always safe). The hint is honored by the V1 query planner; the V2 physical optimizer determines colocation on its own and ignores it.Hint placement. Unlike a join/window node, a set operation is an ancestor of its branch
SELECTs, so a hint on the leadingSELECTdoes not naturally attach to it. The hint is therefore resolved from either the set operation itself or its first branch, supporting two placements:SELECT /*+ setOpOptions(is_colocated_by_set_op_keys='true') */ col FROM a UNION ALL SELECT col FROM bSELECTwrapping the set operation:SELECT /*+ setOpOptions(is_colocated_by_set_op_keys='true') */ * FROM (SELECT col FROM a UNION ALL SELECT col FROM b)Two limitations worth calling out:
UNIONis rewritten to an aggregate overUNION ALLbefore the exchange rule runs, so the inline hint does not apply to it (useUNION ALL, or the outer-wrap form).INTERSECT/EXCEPTthe inline hint only colocates the innermost level (a safe degradation — the outer levels shuffle); the outer-wrap form covers all levels.Tests added:
QueryCompilationTestasserting the hint forces / disables a pre-partitioned exchange acrossUNION ALL/INTERSECT/EXCEPT, both the inline and outer-wrap placements, the no-hint baseline, auto-detection, the='false'override, and first-input-wins precedence when branches carry conflicting values.ExplainPhysicalPlans.jsonshowing the hint turn a full shuffle into a[PARTITIONED](direct, 1-to-1) exchange.QueryHints.jsonon physically-partitioned tables:INTERSECT/EXCEPT/UNION ALLwith='true', the='false'override, a multi-column set op colocated on a subset (the partition column) of the projected columns, and a mismatched-partition-count case where the planner cannot form a direct exchange and safely falls back to a shuffle.Follow-up: user-facing documentation for the new hint will be added to the
pinot-docsrepo, mirroring the existingis_colocated_by_join_keysentry (scope, the V1-only note, and the partitioning precondition under which'true'is safe).