perf: improve approx_distinct performance 100x when there are fewer distinct values with many groups by haohuaijin · Pull Request #22768 · apache/datafusion

haohuaijin · 2026-06-05T02:12:36Z

Which issue does this PR close?

Closes improve performance for apporx_distinct when each group do no have many distinct value #22767

Rationale for this change

approx_distinct is very slow with GROUP BY on high-cardinality keys.

On a dataset (~3.9M rows, ~512K groups), one file from the dataset describe in #22767

SELECT client_ip, approx_distinct(trace_id) AS cnt
FROM '*.parquet'
GROUP BY client_ip
ORDER BY cnt DESC LIMIT 10;

DataFusion: ~32.6s
DuckDB (approx_count_distinct): ~0.1s

The reason is that approx_distinct only implemented Accumulator, not GroupsAccumulator. So grouped queries fell back to GroupsAccumulatorAdapter, which allocates a full 16 KiB HyperLogLog per group (~8 GB for 512K groups) and re-slices the input per group on every batch — even though most groups only see a few distinct values.

What changes are included in this PR?

Add a dedicated GroupsAccumulator for approx_distinct that processes each batch in a single pass (no per-group slicing or dynamic dispatch).
Use an adaptive per-group sketch: keep a small list of hashes (sparse) and only switch to a dense 16 KiB HyperLogLog after 256 distinct values. This cuts memory and keeps the partial state small. The dense format stays compatible with the existing scalar accumulator.
Add count_from_hashes so small groups are estimated directly from their stored hashes, avoiding a 16 KiB alloc + scan per group at output time.
Hashing matches the existing per-type scalar accumulators, so results are unchanged. Boolean / small-int / Null keep using the old path.

Result on the query above: ~32.6s → ~0.12s (~270x, on par with DuckDB), with identical output.

Are these changes tested?

Yes.

New unit tests for the per-group sketch (sparse/dense, promotion, serialize/merge round-trip, merging groups, empty groups), checked against a dense-fold reference.
New aggregate.slt cases: grouped approx_distinct over Utf8, Utf8View, and Int32 (small groups are exact), null-only groups (= 0), and a sparse→dense case (2000 distinct/group, within HyperLogLog error).
Existing aggregate.slt and aggregate_skip_partial.slt still pass; clippy and fmt are clean.

Are there any user-facing changes?

No API or result changes — only a large speedup for approx_distinct with GROUP BY on high-cardinality keys.

haohuaijin · 2026-06-05T02:15:50Z

~~i'm current try to submit a parquet file or benchmark to reproduce the result~~
added benchmark in 9660fc0

benchmark result

╰─$ critcmp main new                                                                
group                                            main                                    22768
-----                                            ----                                    ---
approx_distinct_grouped/Int64 50000 groups       101.11 1723.0±22.12ms        ? ?/sec    1.00     17.0±0.25ms        ? ?/sec
approx_distinct_grouped/Utf8 50000 groups        96.08 1744.4±38.15ms        ? ?/sec     1.00     18.2±0.74ms        ? ?/sec
approx_distinct_grouped/Utf8View 50000 groups    101.45 1724.4±17.53ms        ? ?/sec    1.00     17.0±0.12ms        ? ?/sec

kosiew

@haohuaijin
Thanks for the optimization here. I think there is one nullable filter case that needs to be fixed before this can land. I also left a few smaller suggestions around consistency and malformed state handling.

kosiew · 2026-06-05T07:01:41Z

+                delta += groups[group_indices[row]].add_hash(hash);
+            }),
+            Some(filter) => H::for_each_hash(values[0].as_ref(), |row, hash| {
+                if filter.value(row) {


I think this fast path needs to treat a NULL aggregate filter the same as false.

The generic adapter handled this through Arrow filter, but this path only checks filter.value(row). For a nullable boolean filter, the value bit can still be true on a null row, which would incorrectly add that row to approx_distinct.

Could we gate on validity too, for example filter.is_valid(row) && filter.value(row)? It would also be great to add a grouped approx_distinct(...) FILTER (WHERE nullable_bool) regression test with a null filter row.

Nulls are checked inside for_each_hash now.

But I think we should first fold(bitwise or) the nulls array and opt_filter, then process row by row. This way we can make for_each_hash simpler, and also faster.

I vaguely remember there are some existing utility function/pattern to do so in other GroupsAccumulator implementation.

kosiew · 2026-06-05T07:01:42Z

+        let states = downcast_value!(values[0], BinaryArray);
+        let mut delta: isize = 0;
+        for (row, &group_index) in group_indices.iter().enumerate() {
+            if let Some(filter) = opt_filter


Same nullable filter concern here if merge_batch continues to accept opt_filter.

Could we skip rows when filter.is_null(row) || !filter.value(row)? If final-stage filters are not expected here, another option would be to assert opt_filter.is_none(), similar to some other aggregate implementations.

I am also a bit confused for this API, what's the semantics for opt_filter? In update_filter it's quite obvious, they refer to the filter in the original query like avg(x) filter x>0, here can we assume they're always None, and for groups with Null state, it should be encoded in the values null mask?

I can't find the doc on trait, it seem to depend on implementation now. I suggest we can figure it out the existing practice and update the doc. (in follow up PR for sure)

kosiew · 2026-06-05T07:01:42Z

+/// Returns true for the data types backed by the HyperLogLog
+/// [`HllGroupsAccumulator`]. The fixed-domain types (booleans / small ints) and
+/// `Null` fall back to the per-group [`Accumulator`] path.
+fn is_hll_groups_type(data_type: &DataType) -> bool {


is_hll_groups_type looks a little broader than create_groups_accumulator. For example, it allows Time32(_) and Time64(_), while creation only accepts specific valid units.

Could we make this predicate exactly match the creation logic, or derive both from a shared helper? That would avoid groups_accumulator_supported() returning true for a type that creation later rejects.

kosiew · 2026-06-05T07:01:42Z

+        } else {
+            // capacity is unchanged by sort/dedup
+            0
+        }


merge_serialized should probably reject sparse states whose length is not a multiple of 8 in release builds too.

Right now this is only covered by debug_assert_eq!, and chunks_exact would silently drop trailing bytes if a malformed state reaches this boundary.

2010YOUY01

Thank you. I think this design achieves a good balance between performance and simplicity.

My only concern is have we handled groups with null value correct (see comment), otherwise LGTM.

2010YOUY01 · 2026-06-05T07:48:00Z

+            let other: HyperLogLog<u8> = bytes.try_into()?;
+            Ok(self.merge_dense(&other))
+        } else {
+            debug_assert_eq!(bytes.len() % size_of::<u64>(), 0);


I suggest to also assert the serialized size here not exceeding sparse limit

2010YOUY01 · 2026-06-05T07:49:15Z

+/// fallback for high-cardinality `GROUP BY`s: it processes the whole input in a
+/// single vectorized pass (no per-group `take`/slice and no dynamic dispatch),
+/// and the sparse representation avoids allocating a 16 KiB sketch for every
+/// group when most groups only see a few distinct values.


Suggested change

/// group when most groups only see a few distinct values.

/// group when most groups only see a few distinct values.

///

///

/// # Example

///

/// For `SELECT k, approx_distinct(v) FROM t GROUP BY k`, each group owns one

/// independent sketch:

///

/// ```text

/// group state

/// a Sparse([h1, h2, h3, h2])

/// b Dense(HLL registers)

/// ...

/// ```

///

/// Group `a` has fewer than [`SPARSE_LIMIT`] distinct hashes, so it stays in

/// the sparse representation. Before emitting state or estimating the count, the

/// hash list is sorted and deduplicated to `[h1, h2, h3]`, then those hashes are

/// interpreted exactly as if they had been added to a dense [`HyperLogLog`].

///

/// Group `b` has crossed the sparse limit, so its hashes have already been

/// replayed into a dense sketch. New values for `b` update the dense registers

/// directly, and serialized state is the raw [`NUM_REGISTERS`]-byte register

/// array.

2010YOUY01 · 2026-06-05T07:52:48Z

+                delta += groups[group_indices[row]].add_hash(hash);
+            }),
+            Some(filter) => H::for_each_hash(values[0].as_ref(), |row, hash| {
+                if filter.value(row) {


Nulls are checked inside for_each_hash now.

But I think we should first fold(bitwise or) the nulls array and opt_filter, then process row by row. This way we can make for_each_hash simpler, and also faster.

I vaguely remember there are some existing utility function/pattern to do so in other GroupsAccumulator implementation.

2010YOUY01 · 2026-06-05T08:00:07Z

+        let states = downcast_value!(values[0], BinaryArray);
+        let mut delta: isize = 0;
+        for (row, &group_index) in group_indices.iter().enumerate() {
+            if let Some(filter) = opt_filter


I am also a bit confused for this API, what's the semantics for opt_filter? In update_filter it's quite obvious, they refer to the filter in the original query like avg(x) filter x>0, here can we assume they're always None, and for groups with Null state, it should be encoded in the values null mask?

I can't find the doc on trait, it seem to depend on implementation now. I suggest we can figure it out the existing practice and update the doc. (in follow up PR for sure)

2010YOUY01 · 2026-06-05T08:03:22Z

+        for g in groups.iter_mut() {
+            freed += g.heap_bytes();
+            g.serialize(&mut scratch);
+            builder.append_value(&scratch);


I'm wondering how is Null handled here
e.g. select k, approx_distinct(v) ..., and for group key a, the only row is NULL, should we return count as NULL for group 'a'

improve approx_distinct for small value

6030334

github-actions Bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Jun 5, 2026

haohuaijin changed the title ~~improve approx_distinct for small value~~ perf: improve approx_distinct for small value Jun 5, 2026

haohuaijin changed the title ~~perf: improve approx_distinct for small value~~ perf: improve approx_distinct performance when there are fewer distinct values Jun 5, 2026

hengfeiyang approved these changes Jun 5, 2026

View reviewed changes

add benchmark

9660fc0

haohuaijin changed the title ~~perf: improve approx_distinct performance when there are fewer distinct values~~ perf: improve approx_distinct performance 100x when there are fewer distinct values Jun 5, 2026

haohuaijin changed the title ~~perf: improve approx_distinct performance 100x when there are fewer distinct values~~ perf: improve approx_distinct performance 100x when there are fewer distinct values with many groups Jun 5, 2026

haohuaijin added 2 commits June 5, 2026 11:25

update

5a22033

update test case

0d66853

kosiew requested changes Jun 5, 2026

View reviewed changes

2010YOUY01 reviewed Jun 5, 2026

View reviewed changes

-/// group when most groups only see a few distinct values.
+/// group when most groups only see a few distinct values.
+///
+///
+/// # Example
+///
+/// For `SELECT k, approx_distinct(v) FROM t GROUP BY k`, each group owns one
+/// independent sketch:
+///
+/// ```text
+/// group   state
+/// a       Sparse([h1, h2, h3, h2])
+/// b       Dense(HLL registers)
+/// ...
+/// ```
+///
+/// Group `a` has fewer than [`SPARSE_LIMIT`] distinct hashes, so it stays in
+/// the sparse representation. Before emitting state or estimating the count, the
+/// hash list is sorted and deduplicated to `[h1, h2, h3]`, then those hashes are
+/// interpreted exactly as if they had been added to a dense [`HyperLogLog`].
+///
+/// Group `b` has crossed the sparse limit, so its hashes have already been
+/// replayed into a dense sketch. New values for `b` update the dense registers
+/// directly, and serialized state is the raw [`NUM_REGISTERS`]-byte register
+/// array.

Conversation

haohuaijin commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

haohuaijin commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kosiew left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

haohuaijin commented Jun 5, 2026 •

edited

Loading

haohuaijin commented Jun 5, 2026 •

edited

Loading

kosiew left a comment •

edited

Loading

2010YOUY01 Jun 5, 2026 •

edited

Loading