Skip to content

Rewrite SUM(expr + scalar) --> SUM(expr) + scalar*COUNT(expr) #20665

Draft
alamb wants to merge 5 commits intoapache:mainfrom
alamb:alamb/clickbenchmaxxing
Draft

Rewrite SUM(expr + scalar) --> SUM(expr) + scalar*COUNT(expr) #20665
alamb wants to merge 5 commits intoapache:mainfrom
alamb:alamb/clickbenchmaxxing

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Mar 3, 2026

Draft until:

Which issue does this PR close?

Rationale for this change

I want DataFusion to be the fastest paruqet engine on ClickBench. One of the queries where DataFusion is significantly slower is Query 29 which has a very strange pattern of many aggregate functions that are offset by a constant:

SELECT SUM("ResolutionWidth"), SUM("ResolutionWidth" + 1), SUM("ResolutionWidth" + 2), SUM("ResolutionWidth" + 3), SUM("ResolutionWidth" + 4), SUM("ResolutionWidth" + 5), SUM("ResolutionWidth" + 6), SUM("ResolutionWidth" + 7), SUM("ResolutionWidth" + 8), SUM("ResolutionWidth" + 9), SUM("ResolutionWidth" + 10), SUM("ResolutionWidth" + 11), SUM("ResolutionWidth" + 12), SUM("ResolutionWidth" + 13), SUM("ResolutionWidth" + 14), SUM("ResolutionWidth" + 15), SUM("ResolutionWidth" + 16), SUM("ResolutionWidth" + 17), SUM("ResolutionWidth" + 18), SUM("ResolutionWidth" + 19), SUM("ResolutionWidth" + 20), SUM("ResolutionWidth" + 21), SUM("ResolutionWidth" + 22), SUM("ResolutionWidth" + 23), SUM("ResolutionWidth" + 24), SUM("ResolutionWidth" + 25), SUM("ResolutionWidth" + 26), SUM("ResolutionWidth" + 27), SUM("ResolutionWidth" + 28), SUM("ResolutionWidth" + 29), SUM("ResolutionWidth" + 30), SUM("ResolutionWidth" + 31), SUM("ResolutionWidth" + 32), SUM("ResolutionWidth" + 33), SUM("ResolutionWidth" + 34), SUM("ResolutionWidth" + 35), SUM("ResolutionWidth" + 36), SUM("ResolutionWidth" + 37), SUM("ResolutionWidth" + 38), SUM("ResolutionWidth" + 39), SUM("ResolutionWidth" + 40), SUM("ResolutionWidth" + 41), SUM("ResolutionWidth" + 42), SUM("ResolutionWidth" + 43), SUM("ResolutionWidth" + 44), SUM("ResolutionWidth" + 45), SUM("ResolutionWidth" + 46), SUM("ResolutionWidth" + 47), SUM("ResolutionWidth" + 48), SUM("ResolutionWidth" + 49), SUM("ResolutionWidth" + 50), SUM("ResolutionWidth" + 51), SUM("ResolutionWidth" + 52), SUM("ResolutionWidth" + 53), SUM("ResolutionWidth" + 54), SUM("ResolutionWidth" + 55), SUM("ResolutionWidth" + 56), SUM("ResolutionWidth" + 57), SUM("ResolutionWidth" + 58), SUM("ResolutionWidth" + 59), SUM("ResolutionWidth" + 60), SUM("ResolutionWidth" + 61), SUM("ResolutionWidth" + 62), SUM("ResolutionWidth" + 63), SUM("ResolutionWidth" + 64), SUM("ResolutionWidth" + 65), SUM("ResolutionWidth" + 66), SUM("ResolutionWidth" + 67), SUM("ResolutionWidth" + 68), SUM("ResolutionWidth" + 69), SUM("ResolutionWidth" + 70), SUM("ResolutionWidth" + 71), SUM("ResolutionWidth" + 72), SUM("ResolutionWidth" + 73), SUM("ResolutionWidth" + 74), SUM("ResolutionWidth" + 75), SUM("ResolutionWidth" + 76), SUM("ResolutionWidth" + 77), SUM("ResolutionWidth" + 78), SUM("ResolutionWidth" + 79), SUM("ResolutionWidth" + 80), SUM("ResolutionWidth" + 81), SUM("ResolutionWidth" + 82), SUM("ResolutionWidth" + 83), SUM("ResolutionWidth" + 84), SUM("ResolutionWidth" + 85), SUM("ResolutionWidth" + 86), SUM("ResolutionWidth" + 87), SUM("ResolutionWidth" + 88), SUM("ResolutionWidth" + 89) FROM hits;

This is not a pattern I have ever seen in a real query, but it seems like the engine currently at the top of the ClickBench leaderboard has a special case for this pattern. See

Thus I reluctantly conclude that we should have one too.

What changes are included in this PR?

  1. Add a rewrite SUM(expr + scalar) --> SUM(expr) + scalar*COUNT(expr)
  2. Tests for same

This is implemented as a AggregateUDF::simplify rule as discussed on #20180 (comment) and suggested by @UBarney

Note there are quite a few other ideas to potentially make this more general on #15524 but I am going with the simple thing of making it work for the usecase we have in hand (ClickBench)

Are these changes tested?

Yes, new tests are added

Are there any user-facing changes?

Faster performance

@alamb alamb added the performance Make DataFusion faster label Mar 3, 2026
@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Mar 3, 2026
@AdamGS
Copy link
Contributor

AdamGS commented Mar 3, 2026

I was curious about overflow behavior which I think is right, but I have noticed one difference (at least when going through SQL):

statement ok
CREATE TABLE IF NOT EXISTS tbl (val INTEGER UNSIGNED);

statement ok
INSERT INTO tbl VALUES (4294967295);

statement ok
INSERT INTO tbl VALUES (4294967295);

query II
SELECT SUM(val + 1), SUM(val + 2) FROM tbl;
----
8589934592 8589934594

query TT
EXPLAIN SELECT SUM(val + 1), SUM(val + 2)  FROM tbl;
----
logical_plan
01)Aggregate: groupBy=[[]], aggr=[[sum(__common_expr_1 AS tbl.val + Int64(1)), sum(__common_expr_1 AS tbl.val + Int64(2))]]
02)--Projection: CAST(tbl.val AS Int64) AS __common_expr_1
03)----TableScan: tbl projection=[val]
physical_plan
01)AggregateExec: mode=Single, gby=[], aggr=[sum(tbl.val + Int64(1)), sum(tbl.val + Int64(2))]
02)--ProjectionExec: expr=[CAST(val@0 AS Int64) as __common_expr_1]
03)----DataSourceExec: partitions=1, partition_sizes=[2]

query RR
SELECT SUM(val) + 1 * COUNT(val), SUM(val) + 2 * COUNT(val)  FROM tbl;
----
8589934592 8589934594

query TT
EXPLAIN SELECT SUM(val) + 1 * COUNT(val), SUM(val) + 2 * COUNT(val)  FROM tbl;
----
logical_plan
01)Projection: __common_expr_1 + CAST(count(tbl.val) AS Decimal128(20, 0)) AS sum(tbl.val) + Int64(1) * count(tbl.val), __common_expr_1 AS sum(tbl.val) + CAST(Int64(2) * count(tbl.val) AS Decimal128(20, 0))
02)--Projection: CAST(sum(tbl.val) AS Decimal128(20, 0)) AS __common_expr_1, count(tbl.val)
03)----Aggregate: groupBy=[[]], aggr=[[sum(CAST(tbl.val AS UInt64)), count(tbl.val)]]
04)------TableScan: tbl projection=[val]
physical_plan
01)ProjectionExec: expr=[__common_expr_1@0 + CAST(count(tbl.val)@1 AS Decimal128(20, 0)) as sum(tbl.val) + Int64(1) * count(tbl.val), __common_expr_1@0 + CAST(2 * count(tbl.val)@1 AS Decimal128(20, 0)) as sum(tbl.val) + Int64(2) * count(tbl.val)]
02)--ProjectionExec: expr=[CAST(sum(tbl.val)@0 AS Decimal128(20, 0)) as __common_expr_1, count(tbl.val)@1 as count(tbl.val)]
03)----AggregateExec: mode=Single, gby=[], aggr=[sum(tbl.val), count(tbl.val)]
04)------DataSourceExec: partitions=1, partition_sizes=[2]

statement ok
DROP TABLE IF EXISTS tbl;

The "rewritten" form returns floats for some reason? not sure what that is about

@alamb
Copy link
Contributor Author

alamb commented Mar 3, 2026

The "rewritten" form retruns floats for some reason? not sure what that is about

Thank you -- I will investigate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation logical-expr Logical plan and expressions performance Make DataFusion faster sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make Clickbench Q29 5x faster for datafusion by extracting SUM(..) clauses

2 participants