Skip to content

Conversation

@jayshrivastava
Copy link
Collaborator

@jayshrivastava jayshrivastava commented Nov 19, 2025

This change introduces a new property_based.rs test utility which lets us evaluate
correctness using properties. These are useful for evaluating correctness when
we do not know the expected output of a test (ex. if we were to fuzz the database
with randomized data or randomzed queries, then we can only verify the output using
properties). The two oracles are

  • ResultSetOracle: Compares the result set between single node and distributed datafusion
  • OrderingOracle: Uses plan properties to figure out the expected ordering and asserts it

This change does not introduce a fuzz test, but it introduces a TPC-DS test. This test
randomly generates data using the duckdb CLI and runs 99 queries on a distributed cluster.
The query outputs are validated against single-node datafusion using test utils in
metamorphic.rs. This test also randomizes the test cluster parameters - there's no harm
in doing so.

Next steps:

  • Add fuzzing
    • Now that we have property-based testing utils, we can properly fuzz the project
      using SQLancer
    • SQLancer produces INSERT and SELECT statements which we could point at a datafusion
      distributed cluster and verify against single node datafusion
    • Although it doesn't support nested select statements, 70% of the queries were valid
      datafusion queries, meaning these are good test cases for us
  • Add metrics oracle to validate output_rows metric / metrics propagation

@jayshrivastava jayshrivastava changed the title Js/tpcds fuzz Add fuzzing infrastructure for distributed DataFusion Nov 19, 2025
@jayshrivastava jayshrivastava force-pushed the js/tpcds-fuzz branch 6 times, most recently from 4684bbe to 281c72e Compare November 19, 2025 22:43
@jayshrivastava jayshrivastava marked this pull request as ready for review November 19, 2025 22:43
@jayshrivastava jayshrivastava changed the title Add fuzzing infrastructure for distributed DataFusion add tpc-ds tests and property-based testing utilities Nov 20, 2025
@jayshrivastava jayshrivastava force-pushed the js/tpcds-fuzz branch 10 times, most recently from 9bd3bfc to ce974fc Compare November 20, 2025 19:07
@jayshrivastava jayshrivastava force-pushed the js/tpcds-fuzz branch 6 times, most recently from d5d6f42 to 14e923a Compare November 20, 2025 20:29
@jayshrivastava jayshrivastava force-pushed the js/tpcds-fuzz branch 2 times, most recently from 9f7e288 to aaab00b Compare November 20, 2025 21:07
@jayshrivastava
Copy link
Collaborator Author

@gabotechs This is ready for another review 🙇🏽

Comment on lines 60 to 67
- name: Upload test artifacts on failure
if: failure() || steps.test.outcome == 'failure'
uses: actions/upload-artifact@v4
with:
name: tpcds-test-artifacts-${{ github.run_id }}
path: testdata/tpcds/data/**
retention-days: 7
if-no-files-found: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 why would we want to upload the data on failure as an artifact?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I previously thought that data gen was random, but it's deterministic based on the scale factor. Removed.

Comment on lines 68 to 73
- name: Clean up test data
run: |
rm -rf testdata/tpcds/data/*
rm -f $HOME/.local/bin/duckdb
rm -rf /home/runner/.duckdb
df -h
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub runners are stateless, meaning that each step run is done in a newly spawned machine, so this step is unnecessary, as the machine running this step is destroyed after the run.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +51 to +56
run: |
curl https://install.duckdb.org | sh
mkdir -p $HOME/.local/bin
mv /home/runner/.duckdb/cli/latest/duckdb $HOME/.local/bin/
echo "$HOME/.local/bin" >> $GITHUB_PATH
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it not sufficient to just do:

curl https://install.duckdb.org | sh

I imagine that will already install the cli in the appropriate place accessible from $PATH

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the script, it does not add /home/runner/.duckdb/cli/latest/duckdb to the $PATH.

Cargo.toml Outdated
bytes = "1.10.1"

# integration_tests deps
base64 = { version = "0.22", optional = true }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this and rand_chacha be also added to dev-dependencies?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed both dependencies. Since we aren't randomizing anything, we don't need either. Removed the rand test util as well.

Comment on lines 29 to 53
pub async fn new(test_ctx: SessionContext, compare_ctx: SessionContext) -> Result<Self> {
let oracles: Vec<Box<dyn Oracle + Send + Sync>> =
vec![Box::new(ResultSetOracle {}), Box::new(OrderingOracle {})];

Ok(Validator {
test_ctx,
compare_ctx,
oracles,
})
}

/// Create a new Validator with ordering checks enabled.
pub async fn new_with_ordering(
test_ctx: SessionContext,
compare_ctx: SessionContext,
) -> Result<Self> {
let oracles: Vec<Box<dyn Oracle + Send + Sync>> =
vec![Box::new(ResultSetOracle {}), Box::new(OrderingOracle {})];

Ok(Validator {
test_ctx,
compare_ctx,
oracles,
})
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions are exactly the same, and the second one is unused, so I think it can be removed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed as a part of the refactor

Comment on lines 21 to 24
let num_workers = rng.gen_range(3..=8);
let files_per_task = rng.gen_range(2..=4);
let cardinality_task_count_factor = rng.gen_range(1.1..=3.0);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not convinced about this idea, if this has the bad luck of using num_workers=3 and cardinality_task_count_factor=3.0 then very little things will get distributed, making this test randomly succeed in situations where it should be failing.

I'm struggling to find value in randomizing this parameters, it seems like we are trying to brute force into finding issues that could perfectly be replicated with appropriate specific numbers, and I'm afraid the results of randomness are just going to be flaky tests instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order words, for the same reason that we are randomizing config values in this test, we should also be randomizing config values in every other test in this project. I don't think the TPC-DS tests have anything special on top of all other tests that makes it more suitable for parameters to be randomized.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I hardcoded this

const NUM_WORKERS: usize = 4;
const FILES_PER_TASK: usize = 2;
const CARDINALITY_TASK_COUNT_FACTOR: f64 = 2.0;

let me know if different numbers make more sense.

Comment on lines 90 to 100
eprintln!(
"Test summary - Success: {} Invalid: {} Failed: {} Valid %: {:.2}%",
successful,
invalid,
failed,
if successful + invalid > 0 {
(successful as f64 / (successful + invalid) as f64) * 100.0
} else {
0.0
}
);
Copy link
Collaborator

@gabotechs gabotechs Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than having our print statements for reporting test results, I would arrange tests here so that it's plain cargo test the one that performs the reporting.

This is how it's done upstream:

https://github.com/apache/datafusion/blob/7fa2a694bbc18608a46f85974e721fafa2503219/datafusion/core/tests/tpcds_planning.rs#L32-L32

And we also do the same in this project for the TPCH test.

having independent tests per query is also going to allow us to rerun just the query that fails vs running them all even if we only care about one.

Having one test per query is going to result in ~1000 LOC to make, so very good opportunity to learn about VIM macros.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I did not learn VIM macros 😅. I asked claude

Comment on lines 95 to 101
async fn validate(
&self,
_test_ctx: &SessionContext,
compare_ctx: &SessionContext,
query: &str,
test_result: &Result<Vec<RecordBatch>>,
) -> Result<()> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this can be simplified to just use a couple of functions?

Having this trait here is:

  • forcing you to do some juggling by ignoring certain arguments in certain implementations
  • running the query unnecessarily twice in single node, like in the OrderingOracle (run first in Validator::run and second in OrderingOracle::validate)
  • having an extra struct that glues all implementations together

I think that unless we start using a fuzzing framework that require us to implement some traits, we are perfectly fine with just simple pure functions, for example something like this:

be2c567

Trims 100 lines of code of extra traits and structs and it runs the same code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I copied your commit

jayshrivastava and others added 3 commits November 21, 2025 13:47
This change introduces a new `property_based.rs` test utility which lets us evaluate
correctness using properties. These are useful for evaluating correctness when
we do not know the expected output of a test (ex. if we were to fuzz the database
with randomized data or randomzed queries, then we can only verify the output using
properties). The two oracles are
- ResultSetOracle: Compares the result set between single node and distributed datafusion
- OrderingOracle: Uses plan properties to figure out the expected ordering and asserts it

This change does not introduce a fuzz test, but it introduces a TPC-DS test. This test
randomly generates data using the duckdb CLI and runs 99 queries on a distributed cluster.
The query outputs are validated against single-node datafusion using test utils in
`metamorphic.rs`. This test also randomizes the test cluster parameters - there's no harm
in doing so.

Next steps:
- Add fuzzing
  - Now that we have property-based testing utils, we can properly fuzz the project
    using SQLancer
  - SQLancer produces INSERT and SELECT statements which we could point at a datafusion
    distributed cluster and verify against single node datafusion
  - Although it doesn't support nested select statements, 70% of the queries were valid
    datafusion queries, meaning these are good test cases for us
- Add metrics oracle to validate output_rows metric / metrics propagation
This commit adds tpcds-randomized-test to ci. It relies on the duckdb cli for tpc-ds database generation. It also saves the artifacts if the test fails so we can reproduce issues.
@jayshrivastava jayshrivastava force-pushed the js/tpcds-fuzz branch 6 times, most recently from 862613f to 9dc1b85 Compare November 21, 2025 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants