Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Nov 21, 2025

Motivation

When serialization formats change (e.g., primary_keys vs primary_key), previously serialized schemas/collections become unreadable. Schema deserialization already handles this via a strict parameter, but collections did not. Additionally, when reading serialized data, using non-strict schema deserialization in all cases that would allow validation to be run allows to recover from unreadable metadata where it is not needed.

Fixes #230

Changes

  • Added strict parameter to deserialize_collection: Mirrors existing deserialize_schema behavior—when strict=False, returns None on deserialization errors instead of raising exceptions
  • Propagated strict through deserialization chain: Updated _deserialize_types to accept and forward the parameter
  • Connected to scan_parquet/read_parquet validation modes: Both Collection._read and Schema._validate_if_needed now pass strict=False when validation is "allow", "skip" or "warn", allowing automatic fallback to validation when old formats are detected
  • Added DeserializationError exception: Created a new exception class that is raised when deserialization fails with strict=True, providing a clear and consistent error type for both schema and collection deserialization failures
  • Parametrized tests over storage backends: Updated test_read_write_old_metadata_contents for collections to use TESTERS parametrization instead of being Parquet-specific
  • Added set_metadata to CollectionStorageTester: Implemented abstract method with Parquet and Delta backend implementations to support testing metadata manipulation across storage backends
  • Updated tests to use set_metadata pattern: Updated test_read_write_parquet_schema_json_fallback_corrupt to use write_untyped + set_metadata instead of passing metadata kwarg which was being ignored

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits November 21, 2025 14:49
…rquet error handling

Co-authored-by: MoritzPotthoffQC <160181542+MoritzPotthoffQC@users.noreply.github.com>
Co-authored-by: MoritzPotthoffQC <160181542+MoritzPotthoffQC@users.noreply.github.com>
Copilot AI changed the title [WIP] Catch errors when reading serialized schemas for collections Add strict parameter to collection deserialization for backward compatibility Nov 21, 2025
Copilot finished work on behalf of MoritzPotthoffQC November 21, 2025 14:54
@codecov
Copy link

codecov bot commented Nov 21, 2025

Codecov Report

❌ Patch coverage is 94.11765% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.86%. Comparing base (1660d88) to head (f0f137e).

Files with missing lines Patch % Lines
dataframely/testing/storage.py 89.47% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##              main     #231      +/-   ##
===========================================
- Coverage   100.00%   99.86%   -0.14%     
===========================================
  Files           53       53              
  Lines         3019     3061      +42     
===========================================
+ Hits          3019     3057      +38     
- Misses           0        4       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@MoritzPotthoffQC MoritzPotthoffQC changed the title Add strict parameter to collection deserialization for backward compatibility feat: Recover from reading incompatible schema metadata if validation can be used Nov 24, 2025
@github-actions github-actions bot added the enhancement New feature or request label Nov 24, 2025
@MoritzPotthoffQC MoritzPotthoffQC changed the title feat: Recover from reading incompatible schema metadata if validation can be used feat: Recover from reading incompatible schema metadata if validation is allowed Nov 24, 2025
@MoritzPotthoffQC MoritzPotthoffQC marked this pull request as ready for review November 24, 2025 14:04
Copy link
Collaborator

@AndreasAlbertQC AndreasAlbertQC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MoritzPotthoffQC ! Small questions

["tmp_path", pytest.param("s3_tmp_path", marks=pytest.mark.s3)],
indirect=True,
)
def test_read_write_parquet_old_schema_contents(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this test be parametrized over storage backends rather being parquet specific?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Also fix metadata on untyped writes and get rid of kwargs (?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the test to use @pytest.mark.parametrize("tester", TESTERS) and added set_metadata method to CollectionStorageTester with implementations for both Parquet and Delta backends. The test is now renamed to test_read_write_old_metadata_contents and uses the write_typed + set_metadata pattern. Commit: 9e65575

["tmp_path", pytest.param("s3_tmp_path", marks=pytest.mark.s3)],
indirect=True,
)
def test_read_write_parquet_old_metadata_contents(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same Q here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema test test_read_write_parquet_old_metadata_contents already uses @pytest.mark.parametrize("tester", TESTERS) and the set_metadata method. Updated to use the new DeserializationError exception. Commit: 9e65575

)
except (ValueError, TypeError, JSONDecodeError, plexc.ComputeError) as e:
if strict:
raise e from e
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to raise a specific exception here that states in its text that we are raising because of strict?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: raise DeserializationError("...") from e

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot raise a DeserializationError("The collection could not be deserialized") from e in this case. Create the exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created DeserializationError exception and now raising it with raise DeserializationError("The collection could not be deserialized") from e (also applied to schema deserialization for consistency). The exception is exported from dataframely for public use. Commit: 9e65575

Comment on lines 440 to 449
tester.write_untyped(
collection,
any_tmp_path,
lazy,
metadata={
COLLECTION_METADATA_KEY: collection.serialize().replace(
"primary_key", "primary_keys"
)
},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot currently, metadata is ignored on both write_untyped and write_typed. Please add methods set_metadata to the respective classes that overwrite the existing metadata with the given metadata and use them wherever a test currently uses the metadata kwarg of these methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated test_read_write_parquet_schema_json_fallback_corrupt to use write_untyped + set_metadata pattern instead of passing metadata kwarg. The set_metadata method was already added to CollectionStorageTester in a previous commit. Commit: 96d336a

…torage backends

Co-authored-by: MoritzPotthoffQC <160181542+MoritzPotthoffQC@users.noreply.github.com>
…rquet_schema_json_fallback_corrupt

Co-authored-by: MoritzPotthoffQC <160181542+MoritzPotthoffQC@users.noreply.github.com>
Copilot finished work on behalf of MoritzPotthoffQC November 28, 2025 18:00
tester.set_metadata(
any_tmp_path,
lazy,
metadata={COLLECTION_METADATA_KEY: "} this is not a valid JSON {"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was previously ignored (the test passed because there is no metadata in the first place).

1. **Python code**: Run `pixi run pre-commit run` before committing
2. **Rust code**: Run `pixi run postinstall` to rebuild, then run tests
3. **Tests**: Ensure `pixi run test` passes
3. **Tests**: Ensure `pixi run test` passes. If changes might affect storage backends, use `pixi run test -m s3`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope that copilot is clever enough to figure this out and this saves some running time versus always requiring this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Catch errors when reading serialized schemas also for collections

3 participants