Skip to content

Conversation

@rluvaton
Copy link
Member

@rluvaton rluvaton commented Dec 28, 2025

Which issue does this PR close?

N/A

Rationale for this change

grouping on large amount of data is very slow

What changes are included in this PR?

used my fork of hashbrown with prefetching and added prefetching when map is large

Are these changes tested?

No

Are there any user-facing changes?

No


Related hashbrown issue for allowing prefetching:

My hashbrown changes:

my hashbrown changes are not very safe and probably have bugs but just to make it quick and dirty to see the benefit

Command:

The optimizer can optimize the group by away in this case but it is just a cheap way to have a table and do an multi aggregate on it

datafusion-cli --command "select value from range(0,100000000) group by value, value + 1;"

When the hashmap is smaller, the prefetching is unneeded as it should already fit in the cache

with the following config:

The configs are meant for making it almost single threaded so we can push the single thread to its limit

set datafusion.execution.coalesce_batches=false;
set datafusion.optimizer.repartition_aggregations=false;
set datafusion.optimizer.enable_round_robin_repartition=false;

-- Adding this
set datafusion.execution.agg_prefetch_elements=0; -- 0 is disabled 
set datafusion.execution.agg_prefetch_locality=3; -- no affect if agg_prefetch_elements is 0 
set datafusion.execution.agg_prefetch_read=false; -- no affect if agg_prefetch_elements is 0

Without prefetch: 16.002 seconds
With prefetch: 10.176 seconds

Metric Before After Improvement
LLC-load-misses 91.9M 25.7M 72% reduction
LLC-loads 202M 101.7M 50% reduction
Cycles 62.4B 39.8B 36% reduction
IPC 0.49 0.80 63% improvement

Env

Machine: c5.metal

$ ./neofetch
             `-/oydNNdyo:.`                ec2-user@
      `.:+shmMMMMMMMMMMMMMMmhs+:.`         ------------------------------------------------------
    -+hNNMMMMMMMMMMMMMMMMMMMMMMNNho-       OS: Amazon Linux 2023.9.20251208 x86_64
.``      -/+shmNNMMMMMMNNmhs+/-      ``.   Host: Amazon EC2
dNmhs+:.       `.:/oo/:.`       .:+shmNd   Kernel: 6.1.158-180.294.amzn2023.x86_64
dMMMMMMMNdhs+:..        ..:+shdNMMMMMMMd   Uptime: 51 mins
dMMMMMMMMMMMMMMNds    odNMMMMMMMMMMMMMMd   Packages: 800 (rpm)
dMMMMMMMMMMMMMMMMh    yMMMMMMMMMMMMMMMMd   Shell: bash 5.2.15
dMMMMMMMMMMMMMMMMh    yMMMMMMMMMMMMMMMMd   Terminal: /dev/pts/1
dMMMMMMMMMMMMMMMMh    yMMMMMMMMMMMMMMMMd   CPU: Intel Xeon Platinum 8275CL (96) @ 3.900GHz
dMMMMMMMMMMMMMMMMh    yMMMMMMMMMMMMMMMMd   Memory: 2514MiB / 193043MiB
dMMMMMMMMMMMMMMMMh    yMMMMMMMMMMMMMMMMd
dMMMMMMMMMMMMMMMMh    yMMMMMMMMMMMMMMMMd
dMMMMMMMMMMMMMMMMh    yMMMMMMMMMMMMMMMMd
dMMMMMMMMMMMMMMMMh    yMMMMMMMMMMMMMMMMd
dMMMMMMMMMMMMMMMMh    yMMMMMMMMMMMMMMMMd
.:+ydNMMMMMMMMMMMh    yMMMMMMMMMMMNdy+:.
     `.:+shNMMMMMh    yMMMMMNhs+:``
            `-+shy    shs+:`

$ ./cpufetch --verbose
Name:                Intel Xeon Platinum 8275CL
Microarchitecture:   Cascade Lake
Technology:          14nm
Max Frequency:       3.900 GHz
Sockets:             2
Cores:               24 cores (48 threads)
Cores (Total):       48 cores (96 threads)
AVX:                 AVX,AVX2,AVX512
FMA:                 FMA3
L1i Size:            32KB (1.5MB Total)
L1d Size:            32KB (1.5MB Total)
L2 Size:             1MB (48MB Total)
L3 Size:             35.75MB (71.5MB Total)
Peak Performance:    11.98 TFLOP/s
Results

Without prefetch

perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses datafusion-cli --command "
set datafusion.execution.coalesce_batches=false;
set datafusion.optimizer.repartition_aggregations=false;
set datafusion.optimizer.enable_round_robin_repartition=false;
set datafusion.execution.agg_prefetch_elements=0;
set datafusion.execution.agg_prefetch_locality=3;
set datafusion.execution.agg_prefetch_read=false;
select value from range(0,100000000) group by value, value + 1;
"
DataFusion CLI v51.0.0
0 row(s) fetched.
Elapsed 0.001 seconds.

0 row(s) fetched.
Elapsed 0.000 seconds.

0 row(s) fetched.
Elapsed 0.000 seconds.

0 row(s) fetched.
Elapsed 0.000 seconds.

0 row(s) fetched.
Elapsed 0.000 seconds.

0 row(s) fetched.
Elapsed 0.000 seconds.

+-------+
| value |
+-------+
| 0     |
| 1     |
| 2     |
| 3     |
| 4     |
| 5     |
| 6     |
| 7     |
| 8     |
| 9     |
| 10    |
| 11    |
| 12    |
| 13    |
| 14    |
| 15    |
| 16    |
| 17    |
| 18    |
| 19    |
| 20    |
| 21    |
| 22    |
| 23    |
| 24    |
| 25    |
| 26    |
| 27    |
| 28    |
| 29    |
| 30    |
| 31    |
| 32    |
| 33    |
| 34    |
| 35    |
| 36    |
| 37    |
| 38    |
| 39    |
| .     |
| .     |
| .     |
+-------+
100000000 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 16.002 seconds.


 Performance counter stats for 'datafusion-cli --command
set datafusion.execution.coalesce_batches=false;
set datafusion.optimizer.repartition_aggregations=false;
set datafusion.optimizer.enable_round_robin_repartition=false;
set datafusion.execution.agg_prefetch_elements=0;
set datafusion.execution.agg_prefetch_locality=3;
set datafusion.execution.agg_prefetch_read=false;
select value from range(0,100000000) group by value, value + 1;
':

       62399336707      cycles                                                               (62.64%)
       30518679380      instructions                     #    0.49  insn per cycle           (75.11%)
         589924410      cache-references                                                     (75.11%)
         327996655      cache-misses                     #   55.600 % of all cache refs      (75.11%)
        6616784604      L1-dcache-loads                                                      (75.11%)
         880862425      L1-dcache-load-misses            #   13.31% of all L1-dcache accesses  (75.11%)
         201998979      LLC-loads                                                            (49.81%)
          91945353      LLC-load-misses                  #   45.52% of all LL-cache accesses  (50.00%)

      16.036173799 seconds time elapsed

      14.416051000 seconds user
       1.664161000 seconds sys

With prefetch

$ perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses datafusion-cli --command "
set datafusion.execution.coalesce_batches=false;
set datafusion.optimizer.repartition_aggregations=false;
set datafusion.optimizer.enable_round_robin_repartition=false;
set datafusion.execution.agg_prefetch_elements=1;
set datafusion.execution.agg_prefetch_locality=3;
set datafusion.execution.agg_prefetch_read=false;
select value from range(0,100000000) group by value, value + 1;
"
DataFusion CLI v51.0.0
0 row(s) fetched.
Elapsed 0.001 seconds.

0 row(s) fetched.
Elapsed 0.000 seconds.

0 row(s) fetched.
Elapsed 0.000 seconds.

0 row(s) fetched.
Elapsed 0.000 seconds.

0 row(s) fetched.
Elapsed 0.000 seconds.

0 row(s) fetched.
Elapsed 0.000 seconds.

+-------+
| value |
+-------+
| 0     |
| 1     |
| 2     |
| 3     |
| 4     |
| 5     |
| 6     |
| 7     |
| 8     |
| 9     |
| 10    |
| 11    |
| 12    |
| 13    |
| 14    |
| 15    |
| 16    |
| 17    |
| 18    |
| 19    |
| 20    |
| 21    |
| 22    |
| 23    |
| 24    |
| 25    |
| 26    |
| 27    |
| 28    |
| 29    |
| 30    |
| 31    |
| 32    |
| 33    |
| 34    |
| 35    |
| 36    |
| 37    |
| 38    |
| 39    |
| .     |
| .     |
| .     |
+-------+
100000000 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 10.176 seconds.


 Performance counter stats for 'datafusion-cli --command
set datafusion.execution.coalesce_batches=false;
set datafusion.optimizer.repartition_aggregations=false;
set datafusion.optimizer.enable_round_robin_repartition=false;
set datafusion.execution.agg_prefetch_elements=1;
set datafusion.execution.agg_prefetch_locality=3;
set datafusion.execution.agg_prefetch_read=false;
select value from range(0,100000000) group by value, value + 1;
':

       39771220883      cycles                                                               (62.54%)
       31973887904      instructions                     #    0.80  insn per cycle           (75.06%)
         633535114      cache-references                                                     (75.18%)
         374760871      cache-misses                     #   59.154 % of all cache refs      (75.18%)
        6598313098      L1-dcache-loads                                                      (75.18%)
         920200907      L1-dcache-load-misses            #   13.95% of all L1-dcache accesses  (75.18%)
         101659922      LLC-loads                                                            (49.66%)
          25681158      LLC-load-misses                  #   25.26% of all LL-cache accesses  (49.84%)

      10.204965327 seconds time elapsed

       8.432703000 seconds user
       1.806306000 seconds sys

@rluvaton rluvaton marked this pull request as draft December 28, 2025 11:54
@github-actions github-actions bot added common Related to common crate physical-plan Changes to the physical-plan crate labels Dec 28, 2025
@rluvaton rluvaton added the performance Make DataFusion faster label Dec 28, 2025
@rluvaton
Copy link
Member Author

@alamb, @Dandandan you might be interested in this (not ready for review, just the general idea)

Comment on lines +653 to +655
pub agg_prefetch_elements: usize, default = 1
pub agg_prefetch_locality: usize, default = 3
pub agg_prefetch_read: bool, default = false
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added config to allow for easy tinkering while prototyping

Comment on lines +550 to +562
match self.agg_prefetch_elements {
0 => self.collect_vectorized_process_context_with_prefetch::<0>(batch_hashes, groups),
1 => self.collect_vectorized_process_context_with_prefetch::<1>(batch_hashes, groups),
2 => self.collect_vectorized_process_context_with_prefetch::<2>(batch_hashes, groups),
3 => self.collect_vectorized_process_context_with_prefetch::<3>(batch_hashes, groups),
4 => self.collect_vectorized_process_context_with_prefetch::<4>(batch_hashes, groups),
5 => self.collect_vectorized_process_context_with_prefetch::<5>(batch_hashes, groups),
6 => self.collect_vectorized_process_context_with_prefetch::<6>(batch_hashes, groups),
7 => self.collect_vectorized_process_context_with_prefetch::<7>(batch_hashes, groups),
8 => self.collect_vectorized_process_context_with_prefetch::<8>(batch_hashes, groups),
_ => self.collect_vectorized_process_context_with_prefetch::<8>(batch_hashes, groups),
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added all of these to make sure when I test specific config I'm not affected by the condition, will not have this later

}
}

#[inline(always)]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline always to make sure my change for extracting to a function is not affecting the prev case

Comment on lines +613 to +619
for i in 1..=PREFETCH {
if READ {
self.map.prefetch_read::<LOCALITY>(batch_hashes[row + i]);
} else {
self.map.prefetch_write::<LOCALITY>(batch_hashes[row + i]);
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefetch the same item multiple times but I should prefetch before the loop and then only prefetch the row + PREFETCH

@rluvaton
Copy link
Member Author

run benchmarks

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing add-prefetching (b2c9125) to 6ce2374 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@rluvaton
Copy link
Member Author

Not sure how much the existing benchmarks will show improvement as they are not testing a lot of unique values

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and add-prefetching
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ add-prefetching ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2597.16 ms │      2596.57 ms │ no change │
│ QQuery 1     │  1050.22 ms │      1079.37 ms │ no change │
│ QQuery 2     │  2064.00 ms │      2054.74 ms │ no change │
│ QQuery 3     │  1164.05 ms │      1153.58 ms │ no change │
│ QQuery 4     │  2245.53 ms │      2233.96 ms │ no change │
│ QQuery 5     │ 28134.78 ms │     28343.58 ms │ no change │
│ QQuery 6     │  3910.45 ms │      3842.06 ms │ no change │
│ QQuery 7     │  3673.88 ms │      3548.20 ms │ no change │
└──────────────┴─────────────┴─────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary              ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)              │ 44840.07ms │
│ Total Time (add-prefetching)   │ 44852.06ms │
│ Average Time (HEAD)            │  5605.01ms │
│ Average Time (add-prefetching) │  5606.51ms │
│ Queries Faster                 │          0 │
│ Queries Slower                 │          0 │
│ Queries with No Change         │          8 │
│ Queries with Failure           │          0 │
└────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ add-prefetching ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.05 ms │         2.42 ms │  1.18x slower │
│ QQuery 1     │    51.44 ms │        50.51 ms │     no change │
│ QQuery 2     │   133.98 ms │       135.72 ms │     no change │
│ QQuery 3     │   152.62 ms │       148.08 ms │     no change │
│ QQuery 4     │  1067.74 ms │      1083.09 ms │     no change │
│ QQuery 5     │  1448.88 ms │      1555.15 ms │  1.07x slower │
│ QQuery 6     │     2.04 ms │         2.07 ms │     no change │
│ QQuery 7     │    53.63 ms │        60.52 ms │  1.13x slower │
│ QQuery 8     │  1436.39 ms │      1341.37 ms │ +1.07x faster │
│ QQuery 9     │  1839.98 ms │      1804.70 ms │     no change │
│ QQuery 10    │   343.10 ms │       343.72 ms │     no change │
│ QQuery 11    │   396.00 ms │       395.75 ms │     no change │
│ QQuery 12    │  1339.51 ms │      1318.11 ms │     no change │
│ QQuery 13    │  1969.60 ms │      1948.27 ms │     no change │
│ QQuery 14    │  1241.02 ms │      1187.87 ms │     no change │
│ QQuery 15    │  1215.91 ms │      1197.05 ms │     no change │
│ QQuery 16    │  2530.44 ms │      2419.33 ms │     no change │
│ QQuery 17    │  2493.16 ms │      2383.77 ms │     no change │
│ QQuery 18    │  5085.46 ms │      4446.03 ms │ +1.14x faster │
│ QQuery 19    │   120.72 ms │       121.58 ms │     no change │
│ QQuery 20    │  1922.37 ms │      1876.68 ms │     no change │
│ QQuery 21    │  2217.08 ms │      2184.14 ms │     no change │
│ QQuery 22    │  3772.43 ms │      3736.04 ms │     no change │
│ QQuery 23    │ 12214.15 ms │     12101.57 ms │     no change │
│ QQuery 24    │   214.88 ms │       212.32 ms │     no change │
│ QQuery 25    │   467.60 ms │       445.38 ms │     no change │
│ QQuery 26    │   215.69 ms │       204.74 ms │ +1.05x faster │
│ QQuery 27    │  2723.32 ms │      2683.31 ms │     no change │
│ QQuery 28    │ 21687.73 ms │     21800.58 ms │     no change │
│ QQuery 29    │   950.17 ms │       964.19 ms │     no change │
│ QQuery 30    │  1324.81 ms │      1291.80 ms │     no change │
│ QQuery 31    │  1346.07 ms │      1340.11 ms │     no change │
│ QQuery 32    │  5013.15 ms │      4309.69 ms │ +1.16x faster │
│ QQuery 33    │  5852.38 ms │      5541.85 ms │ +1.06x faster │
│ QQuery 34    │  5982.69 ms │      5631.92 ms │ +1.06x faster │
│ QQuery 35    │  1897.79 ms │      1803.74 ms │     no change │
│ QQuery 36    │    67.22 ms │        64.60 ms │     no change │
│ QQuery 37    │    45.20 ms │        44.23 ms │     no change │
│ QQuery 38    │    66.21 ms │        66.11 ms │     no change │
│ QQuery 39    │   102.91 ms │       102.70 ms │     no change │
│ QQuery 40    │    26.75 ms │        26.33 ms │     no change │
│ QQuery 41    │    23.86 ms │        23.56 ms │     no change │
│ QQuery 42    │    20.13 ms │        19.24 ms │     no change │
└──────────────┴─────────────┴─────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary              ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)              │ 91078.26ms │
│ Total Time (add-prefetching)   │ 88419.97ms │
│ Average Time (HEAD)            │  2118.10ms │
│ Average Time (add-prefetching) │  2056.28ms │
│ Queries Faster                 │          6 │
│ Queries Slower                 │          3 │
│ Queries with No Change         │         34 │
│ Queries with Failure           │          0 │
└────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ add-prefetching ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 119.60 ms │       112.35 ms │ +1.06x faster │
│ QQuery 2     │  28.30 ms │        28.64 ms │     no change │
│ QQuery 3     │  37.32 ms │        36.11 ms │     no change │
│ QQuery 4     │  28.46 ms │        28.60 ms │     no change │
│ QQuery 5     │  85.59 ms │        87.68 ms │     no change │
│ QQuery 6     │  19.77 ms │        19.80 ms │     no change │
│ QQuery 7     │ 230.49 ms │       228.34 ms │     no change │
│ QQuery 8     │  38.10 ms │        35.78 ms │ +1.06x faster │
│ QQuery 9     │ 109.67 ms │       105.37 ms │     no change │
│ QQuery 10    │  61.56 ms │        62.73 ms │     no change │
│ QQuery 11    │  18.42 ms │        18.51 ms │     no change │
│ QQuery 12    │  50.05 ms │        51.47 ms │     no change │
│ QQuery 13    │  47.45 ms │        47.16 ms │     no change │
│ QQuery 14    │  13.64 ms │        13.52 ms │     no change │
│ QQuery 15    │  24.51 ms │        24.17 ms │     no change │
│ QQuery 16    │  24.21 ms │        24.98 ms │     no change │
│ QQuery 17    │ 150.45 ms │       146.03 ms │     no change │
│ QQuery 18    │ 276.08 ms │       278.31 ms │     no change │
│ QQuery 19    │  37.83 ms │        37.81 ms │     no change │
│ QQuery 20    │  49.01 ms │        51.00 ms │     no change │
│ QQuery 21    │ 309.27 ms │       308.25 ms │     no change │
│ QQuery 22    │  17.65 ms │        16.99 ms │     no change │
└──────────────┴───────────┴─────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary              ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)              │ 1777.43ms │
│ Total Time (add-prefetching)   │ 1763.63ms │
│ Average Time (HEAD)            │   80.79ms │
│ Average Time (add-prefetching) │   80.16ms │
│ Queries Faster                 │         2 │
│ Queries Slower                 │         0 │
│ Queries with No Change         │        20 │
│ Queries with Failure           │         0 │
└────────────────────────────────┴───────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate performance Make DataFusion faster physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants