Rewrite SUM(expr + scalar) --> SUM(expr) + scalar*COUNT(expr) #20665
Draft
alamb wants to merge 5 commits intoapache:mainfrom
Draft
Rewrite SUM(expr + scalar) --> SUM(expr) + scalar*COUNT(expr) #20665alamb wants to merge 5 commits intoapache:mainfrom
SUM(expr + scalar) --> SUM(expr) + scalar*COUNT(expr) #20665alamb wants to merge 5 commits intoapache:mainfrom
Conversation
Contributor
|
I was curious about overflow behavior which I think is right, but I have noticed one difference (at least when going through SQL): statement ok
CREATE TABLE IF NOT EXISTS tbl (val INTEGER UNSIGNED);
statement ok
INSERT INTO tbl VALUES (4294967295);
statement ok
INSERT INTO tbl VALUES (4294967295);
query II
SELECT SUM(val + 1), SUM(val + 2) FROM tbl;
----
8589934592 8589934594
query TT
EXPLAIN SELECT SUM(val + 1), SUM(val + 2) FROM tbl;
----
logical_plan
01)Aggregate: groupBy=[[]], aggr=[[sum(__common_expr_1 AS tbl.val + Int64(1)), sum(__common_expr_1 AS tbl.val + Int64(2))]]
02)--Projection: CAST(tbl.val AS Int64) AS __common_expr_1
03)----TableScan: tbl projection=[val]
physical_plan
01)AggregateExec: mode=Single, gby=[], aggr=[sum(tbl.val + Int64(1)), sum(tbl.val + Int64(2))]
02)--ProjectionExec: expr=[CAST(val@0 AS Int64) as __common_expr_1]
03)----DataSourceExec: partitions=1, partition_sizes=[2]
query RR
SELECT SUM(val) + 1 * COUNT(val), SUM(val) + 2 * COUNT(val) FROM tbl;
----
8589934592 8589934594
query TT
EXPLAIN SELECT SUM(val) + 1 * COUNT(val), SUM(val) + 2 * COUNT(val) FROM tbl;
----
logical_plan
01)Projection: __common_expr_1 + CAST(count(tbl.val) AS Decimal128(20, 0)) AS sum(tbl.val) + Int64(1) * count(tbl.val), __common_expr_1 AS sum(tbl.val) + CAST(Int64(2) * count(tbl.val) AS Decimal128(20, 0))
02)--Projection: CAST(sum(tbl.val) AS Decimal128(20, 0)) AS __common_expr_1, count(tbl.val)
03)----Aggregate: groupBy=[[]], aggr=[[sum(CAST(tbl.val AS UInt64)), count(tbl.val)]]
04)------TableScan: tbl projection=[val]
physical_plan
01)ProjectionExec: expr=[__common_expr_1@0 + CAST(count(tbl.val)@1 AS Decimal128(20, 0)) as sum(tbl.val) + Int64(1) * count(tbl.val), __common_expr_1@0 + CAST(2 * count(tbl.val)@1 AS Decimal128(20, 0)) as sum(tbl.val) + Int64(2) * count(tbl.val)]
02)--ProjectionExec: expr=[CAST(sum(tbl.val)@0 AS Decimal128(20, 0)) as __common_expr_1, count(tbl.val)@1 as count(tbl.val)]
03)----AggregateExec: mode=Single, gby=[], aggr=[sum(tbl.val), count(tbl.val)]
04)------DataSourceExec: partitions=1, partition_sizes=[2]
statement ok
DROP TABLE IF EXISTS tbl;The "rewritten" form returns floats for some reason? not sure what that is about |
Contributor
Author
Thank you -- I will investigate |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Draft until:
Which issue does this PR close?
SUM(..)clauses #15524Rationale for this change
I want DataFusion to be the fastest paruqet engine on ClickBench. One of the queries where DataFusion is significantly slower is Query 29 which has a very strange pattern of many aggregate functions that are offset by a constant:
datafusion/benchmarks/queries/clickbench/queries/q29.sql
Line 4 in 0ca9d65
This is not a pattern I have ever seen in a real query, but it seems like the engine currently at the top of the ClickBench leaderboard has a special case for this pattern. See
SUM(..)clauses #15524Thus I reluctantly conclude that we should have one too.
What changes are included in this PR?
SUM(expr + scalar)-->SUM(expr) + scalar*COUNT(expr)This is implemented as a
AggregateUDF::simplifyrule as discussed on #20180 (comment) and suggested by @UBarneyNote there are quite a few other ideas to potentially make this more general on #15524 but I am going with the simple thing of making it work for the usecase we have in hand (ClickBench)
Are these changes tested?
Yes, new tests are added
Are there any user-facing changes?
Faster performance