Skip to content

Conversation

@conormccarter
Copy link
Contributor

@conormccarter conormccarter commented Nov 7, 2025

Resolves: #24

  1. Add Databricks benchmark script
  2. Add results for most Databricks SQL warehouse sizes

@rschu1ze

This comment was marked as resolved.

@conormccarter

This comment was marked as resolved.

@conormccarter conormccarter reopened this Nov 13, 2025
Copy link
Member

@rschu1ze rschu1ze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a permission error when I try to push to this repository:

remote: Permission to prequel-co/ClickBench.git denied to rschu1ze.
fatal: unable to access 'https://github.com/prequel-co/ClickBench.git/': The requested URL returned error: 403

... therefore leaving some comments for now.

DATABRICKS_SCHEMA=clickbench_schema

# Parquet data location
DATABRICKS_PARQUET_LOCATION=s3://some/path/hits.parquet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions here: I set my databricks hostname, the databricks HTTP path, the instance type (2X-Small for the free test version) and the token. I didn't touch the CATALOG and the SCHEMA variables.

When I ran benchmark.sh, I got this:

Connecting to Databricks; loading the data into clickbench_catalog.clickbench_schema                                                                 16:12:40 [247/341]
[WARN] pyarrow is not installed by default since databricks-sql-connector 4.0.0,any arrow specific api (e.g. fetchmany_arrow) and cloud fetch will be disabled.If you n
eed these features, please run pip install pyarrow or pip install databricks-sql-connector[pyarrow] to install
Creating table and loading data from s3://some/path/hits.parquet...
Traceback (most recent call last):
  File "/data/ClickBench/databricks/./benchmark.py", line 357, in <module>
    load_data(run_metadata)
  File "/data/ClickBench/databricks/./benchmark.py", line 289, in load_data
    cursor.execute(load_query)
  File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/telemetry/latency_logger.py", line 175, in wrapper
    result = func(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/client.py", line 1260, in execute
    self.active_result_set = self.backend.execute_command(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/backend/thrift_backend.py", line 1058, in execute_command
    execute_response, has_more_rows = self._handle_execute_response(
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/backend/thrift_backend.py", line 1265, in _handle_execute_response
    final_operation_state = self._wait_until_command_done(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/backend/thrift_backend.py", line 957, in _wait_until_command_done
    self._check_command_not_in_error_or_closed_state(op_handle, poll_resp)
  File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/backend/thrift_backend.py", line 635, in _check_command_not_in_error_or_closed_st
ate
    raise ServerOperationError(
databricks.sql.exc.ServerOperationError: [UNSUPPORTED_DATASOURCE_FOR_DIRECT_QUERY] Unsupported data source type for direct query on files: parquet SQLSTATE: 0A000; lin
e 109 pos 13
Attempt to close session raised a local exception: sys.meta_path is None, Python is likely shutting down

(l. 289 ran the INSERT statement - the prior CREATE TABLE was successful)

Do you have an idea what went wrong? Do I need to set any other variables?

Oh, I should have mentioned as well that I set DATABRICKS_PARQUET_LOCATION to https://clickhouse-public-datasets.s3.eu-central-1.amazonaws.com/hits_compatible/hits.parquet. Is this correct? If yes, I think we can hard-code it as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should work okay if you use the S3 URI (starting with "s3://"). Just updated the example to use that placeholder. Optionally, I could just remove it as a .env variable if that public s3 location is going to stick around.

"load_time": 36.227,
"data_size": 10219802927,
"result": [
[0.552, 0.314, 0.28],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My measurements:

[0.919,0.757,0.332],
[2.078,0.656,0.569],
[1.632,0.674,0.63],
[1.782,0.511,0.62],
[1.675,1.413,1.584],
[3.137,2.293,1.994],
[1.614,0.592,0.568],
[0.594,0.523,0.554],
[2.971,2.133,2.339],
[3.37,3.202,3.037],
[1.528,0.928,0.866],
[1.544,1.015,0.896],
[2.298,2.244,2.054],
[3.137,3.016,3.537],
[3.351,2.586,2.31],
[2.097,2.443,2.071],
[4.276,4.665,4.938],
[3.443,3.76,3.596],
[10.056,7.642,7.589],
[0.392,0.391,0.351],
[3.818,2.363,2.293],
[2.632,2.647,2.53],
[5.895,4.124,3.953],
[16.109,7.771,6.72],
[1.442,1.338,1.303],
[0.972,0.783,0.84],
[1.339,1.178,1.369],
[2.585,2.507,2.714],
[20.725,22.23,21.664],
[3.131,2.943,2.743],
[2.143,1.894,2.076],
[2.215,1.987,1.913],
[8.006,5.456,6.704],
[12.008,11.169,9.873],
[10.754,9.814,9.637],
[2.15,2.29,2.175],
[0.869,0.945,0.697],
[0.458,0.462,0.45],
[0.685,0.503,0.549],
[1.137,0.982,1.005],
[0.51,0.6,0.43],
[0.45,0.524,0.447],
[0.55,0.414,0.374]

My runtimes are ca. 40% slower, perhaps Databricks had a bad day.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! I just ran again and got results very similar to my first run (see below). This is running in AWS us-east-2. Where is yours running? (You can find the region in the Workspaces tab) I would like to try and reproduce your results.

		[0.578,0.663,0.281],
		[1.371,0.402,0.385],
		[0.595,0.398,0.365],
		[0.624,0.343,0.343],
		[0.742,0.571,0.602],
		[0.912,0.628,0.608],
		[0.728,0.433,0.454],
		[0.375,0.34,0.34],
		[4.072,0.753,2.814],
		[1.243,1.075,1.042],
		[0.7,0.477,0.461],
		[0.663,0.491,0.505],
		[0.766,0.627,0.616],
		[0.655,0.688,0.693],
		[3.817,0.757,0.67],
		[4.008,0.71,0.685],
		[4.693,1.025,0.936],
		[3.239,0.502,2.633],
		[1.922,1.055,4.794],
		[0.297,0.241,0.271],
		[3.578,0.476,0.453],
		[3.311,0.515,0.499],
		[3.541,0.745,0.779],
		[4.259,0.853,0.891],
		[0.418,0.335,0.317],
		[0.284,0.32,0.304],
		[0.337,0.315,0.298],
		[0.603,0.516,0.517],
		[3.268,3.237,3.136],
		[4.063,0.504,0.47],
		[3.119,0.619,0.575],
		[0.986,0.489,0.653],
		[1.112,1.148,0.988],
		[1.441,2.189,3.584],
		[2.318,4.584,1.628],
		[3.989,0.558,0.543],
		[0.639,0.588,0.573],
		[0.377,0.375,0.402],
		[0.552,0.393,0.394],
		[1.06,0.906,0.845],
		[0.526,0.346,0.329],
		[0.371,0.339,0.35],
		[0.38,1.354,0.342]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's us-east-2 for me as well, but I'm located in Germany, so we'll need to add some latency for the packets to travel over the big pond and back.

@rschu1ze rschu1ze merged commit 3b07d4f into ClickHouse:main Nov 19, 2025
@conormccarter conormccarter deleted the add-databricks branch November 19, 2025 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Help wanted: Databricks

2 participants