Update the benchmark to load datasets from multiple S3 buckets when None is specified#607
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #607 +/- ##
==========================================
+ Coverage 85.67% 85.91% +0.24%
==========================================
Files 40 40
Lines 3679 3749 +70
==========================================
+ Hits 3152 3221 +69
- Misses 527 528 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
95be93a to
9a91033
Compare
|
|
||
| def _get_dataset_bucket_mapping(modality, buckets, s3_client, skip_inaccessible=False): | ||
| """Map SDV demo dataset names to the bucket they should be loaded from.""" | ||
| dataset_buckets = {} |
There was a problem hiding this comment.
nitpick: can we call dataset_to_bucket
725c22c to
5555441
Compare
| """Main SDGym benchmarking module.""" | ||
|
|
||
| import functools | ||
| import gzip |
There was a problem hiding this comment.
Since the dataset is no longer included in the job_arg_list, we don't need to compress it. Also, it now contains one job by default, and from my tests, it's roughly 1KB.
| def _get_s3_client_from_result_writer(result_writer): | ||
| if isinstance(result_writer, S3ResultsWriter): | ||
| return result_writer.s3_client | ||
|
|
||
| return None |
There was a problem hiding this comment.
Just want to note that in the future, we should setup the s3 client independently from using the same credentials as writing the results.
b97fa0d to
0786528
Compare
4bd6e85 to
4907bb3
Compare
4907bb3 to
e4bfb1d
Compare
Resolve #604
CU-86b9zwpcu
This PR includes:
download_demo()to getsdv_dataset. This works for public datasets or for users with SDV-Enterprise installed. Otherwise, because on SDV there is some validation to allow access to the public bucket only, I reused the SDV logic here, so it works if the S3 client is created with valid credentials (from input keys or env variables).download_demo(), we no longer save the dataset locally._generate_job_arg_list: Now we're only passingdataset_infoinside it. Thedataset_infoincludes everything necessary to then download the data and metadata during execution.