Skip to content

[DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench #12821

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

I am mostly writing this up to record what I think is an ongoing work with @jayzhan211 @Rachelint @korowa and myself

TLDR, we are working on (and getting pretty close) to having DataFusion be the fastest single node engine for querying parquet files in ClickBench

Background:

https://benchmark.clickhouse.com/ shows the results of ClickBench

ClickBench the benchmark and is described here https://github.com/ClickHouse/ClickBench. I am not personally interested in proprietary file formats that require special loading

Here is the current leaderboard for partitioned parquet reflecting DataFusion 40.0.0:

Screenshot 2024-10-08 at 4 45 16 PM

Describe the solution you'd like

I would like DataFusion to be the fastest

Describe alternatives you've considered

No response

Additional context

This is also inspired by @ozankabak 's call to action on #11442

The scripts to run with datafusion are here: https://github.com/ClickHouse/ClickBench/tree/main/datafusion

Last update is here: ClickHouse/ClickBench#210

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions