-
Notifications
You must be signed in to change notification settings - Fork 17
add tpc-ds tests and property-based testing utilities #231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
dc94612 to
3260143
Compare
4684bbe to
281c72e
Compare
281c72e to
1a3260f
Compare
9bd3bfc to
ce974fc
Compare
d5d6f42 to
14e923a
Compare
9f7e288 to
aaab00b
Compare
aaab00b to
5776a5a
Compare
|
@gabotechs This is ready for another review 🙇🏽 |
.github/workflows/ci.yml
Outdated
| - name: Upload test artifacts on failure | ||
| if: failure() || steps.test.outcome == 'failure' | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: tpcds-test-artifacts-${{ github.run_id }} | ||
| path: testdata/tpcds/data/** | ||
| retention-days: 7 | ||
| if-no-files-found: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 why would we want to upload the data on failure as an artifact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I previously thought that data gen was random, but it's deterministic based on the scale factor. Removed.
.github/workflows/ci.yml
Outdated
| - name: Clean up test data | ||
| run: | | ||
| rm -rf testdata/tpcds/data/* | ||
| rm -f $HOME/.local/bin/duckdb | ||
| rm -rf /home/runner/.duckdb | ||
| df -h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitHub runners are stateless, meaning that each step run is done in a newly spawned machine, so this step is unnecessary, as the machine running this step is destroyed after the run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| run: | | ||
| curl https://install.duckdb.org | sh | ||
| mkdir -p $HOME/.local/bin | ||
| mv /home/runner/.duckdb/cli/latest/duckdb $HOME/.local/bin/ | ||
| echo "$HOME/.local/bin" >> $GITHUB_PATH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is it not sufficient to just do:
curl https://install.duckdb.org | shI imagine that will already install the cli in the appropriate place accessible from $PATH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the script, it does not add /home/runner/.duckdb/cli/latest/duckdb to the $PATH.
Cargo.toml
Outdated
| bytes = "1.10.1" | ||
|
|
||
| # integration_tests deps | ||
| base64 = { version = "0.22", optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this and rand_chacha be also added to dev-dependencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed both dependencies. Since we aren't randomizing anything, we don't need either. Removed the rand test util as well.
src/test_utils/property_based.rs
Outdated
| pub async fn new(test_ctx: SessionContext, compare_ctx: SessionContext) -> Result<Self> { | ||
| let oracles: Vec<Box<dyn Oracle + Send + Sync>> = | ||
| vec![Box::new(ResultSetOracle {}), Box::new(OrderingOracle {})]; | ||
|
|
||
| Ok(Validator { | ||
| test_ctx, | ||
| compare_ctx, | ||
| oracles, | ||
| }) | ||
| } | ||
|
|
||
| /// Create a new Validator with ordering checks enabled. | ||
| pub async fn new_with_ordering( | ||
| test_ctx: SessionContext, | ||
| compare_ctx: SessionContext, | ||
| ) -> Result<Self> { | ||
| let oracles: Vec<Box<dyn Oracle + Send + Sync>> = | ||
| vec![Box::new(ResultSetOracle {}), Box::new(OrderingOracle {})]; | ||
|
|
||
| Ok(Validator { | ||
| test_ctx, | ||
| compare_ctx, | ||
| oracles, | ||
| }) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two functions are exactly the same, and the second one is unused, so I think it can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Removed as a part of the refactor
tests/tpc_ds_randomized.rs
Outdated
| let num_workers = rng.gen_range(3..=8); | ||
| let files_per_task = rng.gen_range(2..=4); | ||
| let cardinality_task_count_factor = rng.gen_range(1.1..=3.0); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not convinced about this idea, if this has the bad luck of using num_workers=3 and cardinality_task_count_factor=3.0 then very little things will get distributed, making this test randomly succeed in situations where it should be failing.
I'm struggling to find value in randomizing this parameters, it seems like we are trying to brute force into finding issues that could perfectly be replicated with appropriate specific numbers, and I'm afraid the results of randomness are just going to be flaky tests instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order words, for the same reason that we are randomizing config values in this test, we should also be randomizing config values in every other test in this project. I don't think the TPC-DS tests have anything special on top of all other tests that makes it more suitable for parameters to be randomized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I hardcoded this
const NUM_WORKERS: usize = 4;
const FILES_PER_TASK: usize = 2;
const CARDINALITY_TASK_COUNT_FACTOR: f64 = 2.0;
let me know if different numbers make more sense.
tests/tpc_ds_randomized.rs
Outdated
| eprintln!( | ||
| "Test summary - Success: {} Invalid: {} Failed: {} Valid %: {:.2}%", | ||
| successful, | ||
| invalid, | ||
| failed, | ||
| if successful + invalid > 0 { | ||
| (successful as f64 / (successful + invalid) as f64) * 100.0 | ||
| } else { | ||
| 0.0 | ||
| } | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than having our print statements for reporting test results, I would arrange tests here so that it's plain cargo test the one that performs the reporting.
This is how it's done upstream:
And we also do the same in this project for the TPCH test.
having independent tests per query is also going to allow us to rerun just the query that fails vs running them all even if we only care about one.
Having one test per query is going to result in ~1000 LOC to make, so very good opportunity to learn about VIM macros.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I did not learn VIM macros 😅. I asked claude
src/test_utils/property_based.rs
Outdated
| async fn validate( | ||
| &self, | ||
| _test_ctx: &SessionContext, | ||
| compare_ctx: &SessionContext, | ||
| query: &str, | ||
| test_result: &Result<Vec<RecordBatch>>, | ||
| ) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this can be simplified to just use a couple of functions?
Having this trait here is:
- forcing you to do some juggling by ignoring certain arguments in certain implementations
- running the query unnecessarily twice in single node, like in the
OrderingOracle(run first inValidator::runand second inOrderingOracle::validate) - having an extra struct that glues all implementations together
I think that unless we start using a fuzzing framework that require us to implement some traits, we are perfectly fine with just simple pure functions, for example something like this:
Trims 100 lines of code of extra traits and structs and it runs the same code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I copied your commit
84a780b to
b96a5c2
Compare
This change introduces a new `property_based.rs` test utility which lets us evaluate
correctness using properties. These are useful for evaluating correctness when
we do not know the expected output of a test (ex. if we were to fuzz the database
with randomized data or randomzed queries, then we can only verify the output using
properties). The two oracles are
- ResultSetOracle: Compares the result set between single node and distributed datafusion
- OrderingOracle: Uses plan properties to figure out the expected ordering and asserts it
This change does not introduce a fuzz test, but it introduces a TPC-DS test. This test
randomly generates data using the duckdb CLI and runs 99 queries on a distributed cluster.
The query outputs are validated against single-node datafusion using test utils in
`metamorphic.rs`. This test also randomizes the test cluster parameters - there's no harm
in doing so.
Next steps:
- Add fuzzing
- Now that we have property-based testing utils, we can properly fuzz the project
using SQLancer
- SQLancer produces INSERT and SELECT statements which we could point at a datafusion
distributed cluster and verify against single node datafusion
- Although it doesn't support nested select statements, 70% of the queries were valid
datafusion queries, meaning these are good test cases for us
- Add metrics oracle to validate output_rows metric / metrics propagation
This commit adds tpcds-randomized-test to ci. It relies on the duckdb cli for tpc-ds database generation. It also saves the artifacts if the test fails so we can reproduce issues.
862613f to
9dc1b85
Compare
9dc1b85 to
67ce84c
Compare
This change introduces a new
property_based.rstest utility which lets us evaluatecorrectness using properties. These are useful for evaluating correctness when
we do not know the expected output of a test (ex. if we were to fuzz the database
with randomized data or randomzed queries, then we can only verify the output using
properties). The two oracles are
This change does not introduce a fuzz test, but it introduces a TPC-DS test. This test
randomly generates data using the duckdb CLI and runs 99 queries on a distributed cluster.
The query outputs are validated against single-node datafusion using test utils in
metamorphic.rs. This test also randomizes the test cluster parameters - there's no harmin doing so.
Next steps:
using SQLancer
distributed cluster and verify against single node datafusion
datafusion queries, meaning these are good test cases for us