Skip to content

improve performance for apporx_distinct when each group do no have many distinct value #22767

@haohuaijin

Description

@haohuaijin

Is your feature request related to a problem or challenge?

when i do one test use below sql, i find the performance is not except

SELECT
  client_ip,
  approx_distinct(trace_id) AS cnt
FROM
  "*.parquet"
GROUP BY
  client_ip ORDER BY cnt DESC LIMIT 10

i have 100M rows, and 0.5 M unique client_ip
datafusli-cli need 900s to get the result, but duckdb only need 3s

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request
No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions