Commit 4684bbe
committed
Add fuzzing infrastructure for distributed DataFusion
This change introduces a new `fuzz` binary that enables fuzzing of distributed
DataFusion against a localhost cluster. The fuzzing framework is designed to
be extensible, allowing us to define custom workloads and oracles.
This change defines a TPC-DS workload which uses the duckdb data generator and 99
TPC-DS queries from their github repo. It also defines one oracle, the `SingleNodeOracle`
which validates the set of rows produced by distributed datafusion against the set of
rows produced by single node datafusion (if a query errors in both, then this is not
counted as a failure). You can run this using the following command:
```
RUST_LOG=info cargo run --features integration --bin fuzz -- tpcds --force-regenerate
```
Fuzzing also uses randomized cluster configurations using a deterministic seed.
Next steps:
- Add ordering oracle to validate ORDER BY correctness
- Idea: Inspect the ordering properties in the logical plan and assert this
on the RecordBatches
- Observability
- Log stats on queries that were invalid (ie. failed to execute on single node df)
so we can measure the quality of queries in a workload
- Add metrics oracle to validate output_rows metric (ensuring metrics are
working correctly)
- Set up nightly github actions workflow to run the fuzzer automatically
- Ensure that the data is available to be downloaded so we can reproduce any failures
locally
- Add SQLancer workload
- SQLancer produces INSERT and SELECT statements which we could point at a datafusion
distributed cluster and verify using our oracles
- Although it doesn't support nested select statements, I was able to get a ~30% error
rate (meaning 70% of queries were valid datafusion queries)1 parent 731ad2a commit 4684bbe
File tree
109 files changed
+7007
-496
lines changed- src
- bin
- test_utils
- testdata/tpcds
- queries
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
109 files changed
+7007
-496
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
4 | | - | |
| 4 | + | |
| 5 | + | |
0 commit comments