-
Notifications
You must be signed in to change notification settings - Fork 13
feat: Recover from reading incompatible schema metadata if validation is allowed #231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…rquet error handling Co-authored-by: MoritzPotthoffQC <160181542+MoritzPotthoffQC@users.noreply.github.com>
Co-authored-by: MoritzPotthoffQC <160181542+MoritzPotthoffQC@users.noreply.github.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #231 +/- ##
===========================================
- Coverage 100.00% 99.86% -0.14%
===========================================
Files 53 53
Lines 3019 3061 +42
===========================================
+ Hits 3019 3057 +38
- Misses 0 4 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
AndreasAlbertQC
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MoritzPotthoffQC ! Small questions
tests/collection/test_storage.py
Outdated
| ["tmp_path", pytest.param("s3_tmp_path", marks=pytest.mark.s3)], | ||
| indirect=True, | ||
| ) | ||
| def test_read_write_parquet_old_schema_contents( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this test be parametrized over storage backends rather being parquet specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Also fix metadata on untyped writes and get rid of kwargs (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the test to use @pytest.mark.parametrize("tester", TESTERS) and added set_metadata method to CollectionStorageTester with implementations for both Parquet and Delta backends. The test is now renamed to test_read_write_old_metadata_contents and uses the write_typed + set_metadata pattern. Commit: 9e65575
| ["tmp_path", pytest.param("s3_tmp_path", marks=pytest.mark.s3)], | ||
| indirect=True, | ||
| ) | ||
| def test_read_write_parquet_old_metadata_contents( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same Q here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema test test_read_write_parquet_old_metadata_contents already uses @pytest.mark.parametrize("tester", TESTERS) and the set_metadata method. Updated to use the new DeserializationError exception. Commit: 9e65575
dataframely/collection/collection.py
Outdated
| ) | ||
| except (ValueError, TypeError, JSONDecodeError, plexc.ComputeError) as e: | ||
| if strict: | ||
| raise e from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to raise a specific exception here that states in its text that we are raising because of strict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: raise DeserializationError("...") from e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot raise a DeserializationError("The collection could not be deserialized") from e in this case. Create the exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created DeserializationError exception and now raising it with raise DeserializationError("The collection could not be deserialized") from e (also applied to schema deserialization for consistency). The exception is exported from dataframely for public use. Commit: 9e65575
tests/collection/test_storage.py
Outdated
| tester.write_untyped( | ||
| collection, | ||
| any_tmp_path, | ||
| lazy, | ||
| metadata={ | ||
| COLLECTION_METADATA_KEY: collection.serialize().replace( | ||
| "primary_key", "primary_keys" | ||
| ) | ||
| }, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot currently, metadata is ignored on both write_untyped and write_typed. Please add methods set_metadata to the respective classes that overwrite the existing metadata with the given metadata and use them wherever a test currently uses the metadata kwarg of these methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated test_read_write_parquet_schema_json_fallback_corrupt to use write_untyped + set_metadata pattern instead of passing metadata kwarg. The set_metadata method was already added to CollectionStorageTester in a previous commit. Commit: 96d336a
…torage backends Co-authored-by: MoritzPotthoffQC <160181542+MoritzPotthoffQC@users.noreply.github.com>
…rquet_schema_json_fallback_corrupt Co-authored-by: MoritzPotthoffQC <160181542+MoritzPotthoffQC@users.noreply.github.com>
| tester.set_metadata( | ||
| any_tmp_path, | ||
| lazy, | ||
| metadata={COLLECTION_METADATA_KEY: "} this is not a valid JSON {"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was previously ignored (the test passed because there is no metadata in the first place).
| 1. **Python code**: Run `pixi run pre-commit run` before committing | ||
| 2. **Rust code**: Run `pixi run postinstall` to rebuild, then run tests | ||
| 3. **Tests**: Ensure `pixi run test` passes | ||
| 3. **Tests**: Ensure `pixi run test` passes. If changes might affect storage backends, use `pixi run test -m s3`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope that copilot is clever enough to figure this out and this saves some running time versus always requiring this.
Motivation
When serialization formats change (e.g.,
primary_keysvsprimary_key), previously serialized schemas/collections become unreadable. Schema deserialization already handles this via astrictparameter, but collections did not. Additionally, when reading serialized data, using non-strict schema deserialization in all cases that would allow validation to be run allows to recover from unreadable metadata where it is not needed.Fixes #230
Changes
strictparameter todeserialize_collection: Mirrors existingdeserialize_schemabehavior—whenstrict=False, returnsNoneon deserialization errors instead of raising exceptionsstrictthrough deserialization chain: Updated_deserialize_typesto accept and forward the parameterscan_parquet/read_parquetvalidation modes: BothCollection._readandSchema._validate_if_needednow passstrict=Falsewhenvalidationis "allow", "skip" or "warn", allowing automatic fallback to validation when old formats are detectedDeserializationErrorexception: Created a new exception class that is raised when deserialization fails withstrict=True, providing a clear and consistent error type for both schema and collection deserialization failurestest_read_write_old_metadata_contentsfor collections to useTESTERSparametrization instead of being Parquet-specificset_metadatatoCollectionStorageTester: Implemented abstract method with Parquet and Delta backend implementations to support testing metadata manipulation across storage backendsset_metadatapattern: Updatedtest_read_write_parquet_schema_json_fallback_corruptto usewrite_untyped+set_metadatainstead of passingmetadatakwarg which was being ignored✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.