Skip to content

Commit 4684bbe

Browse files
Add fuzzing infrastructure for distributed DataFusion
This change introduces a new `fuzz` binary that enables fuzzing of distributed DataFusion against a localhost cluster. The fuzzing framework is designed to be extensible, allowing us to define custom workloads and oracles. This change defines a TPC-DS workload which uses the duckdb data generator and 99 TPC-DS queries from their github repo. It also defines one oracle, the `SingleNodeOracle` which validates the set of rows produced by distributed datafusion against the set of rows produced by single node datafusion (if a query errors in both, then this is not counted as a failure). You can run this using the following command: ``` RUST_LOG=info cargo run --features integration --bin fuzz -- tpcds --force-regenerate ``` Fuzzing also uses randomized cluster configurations using a deterministic seed. Next steps: - Add ordering oracle to validate ORDER BY correctness - Idea: Inspect the ordering properties in the logical plan and assert this on the RecordBatches - Observability - Log stats on queries that were invalid (ie. failed to execute on single node df) so we can measure the quality of queries in a workload - Add metrics oracle to validate output_rows metric (ensuring metrics are working correctly) - Set up nightly github actions workflow to run the fuzzer automatically - Ensure that the data is available to be downloaded so we can reproduce any failures locally - Add SQLancer workload - SQLancer produces INSERT and SELECT statements which we could point at a datafusion distributed cluster and verify using our oracles - Although it doesn't support nested select statements, I was able to get a ~30% error rate (meaning 70% of queries were valid datafusion queries)
1 parent 731ad2a commit 4684bbe

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

109 files changed

+7007
-496
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
/.idea
22
/target
33
/benchmarks/data/
4-
testdata/tpch/data/
4+
testdata/tpch/data/
5+
testdata/tpcds/data/

0 commit comments

Comments
 (0)