feat: Recover from reading incompatible schema metadata if validation is allowed #231

Copilot · 2025-11-21T14:31:43Z

Motivation

When serialization formats change (e.g., primary_keys vs primary_key), previously serialized schemas/collections become unreadable. Schema deserialization already handles this via a strict parameter, but collections did not. Additionally, when reading serialized data, using non-strict schema deserialization in all cases that would allow validation to be run allows to recover from unreadable metadata where it is not needed.

Fixes #230

Changes

Added strict parameter to deserialize_collection: Mirrors existing deserialize_schema behavior—when strict=False, returns None on deserialization errors instead of raising exceptions
Propagated strict through deserialization chain: Updated _deserialize_types to accept and forward the parameter
Connected to scan_parquet/read_parquet validation modes: Both Collection._read and Schema._validate_if_needed now pass strict=False when validation is "allow", "skip" or "warn", allowing automatic fallback to validation when old formats are detected
Added DeserializationError exception: Created a new exception class that is raised when deserialization fails with strict=True, providing a clear and consistent error type for both schema and collection deserialization failures
Parametrized tests over storage backends: Updated test_read_write_old_metadata_contents for collections to use TESTERS parametrization instead of being Parquet-specific
Added set_metadata to CollectionStorageTester: Implemented abstract method with Parquet and Delta backend implementations to support testing metadata manipulation across storage backends
Updated tests to use set_metadata pattern: Updated test_read_write_parquet_schema_json_fallback_corrupt to use write_untyped + set_metadata instead of passing metadata kwarg which was being ignored

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

…rquet error handling Co-authored-by: MoritzPotthoffQC <160181542+MoritzPotthoffQC@users.noreply.github.com>

Co-authored-by: MoritzPotthoffQC <160181542+MoritzPotthoffQC@users.noreply.github.com>

codecov · 2025-11-21T15:08:04Z

Codecov Report

❌ Patch coverage is 94.11765% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.86%. Comparing base (1660d88) to head (f0f137e).

Files with missing lines	Patch %	Lines
dataframely/testing/storage.py	89.47%	4 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##              main     #231      +/-   ##
===========================================
- Coverage   100.00%   99.86%   -0.14%     
===========================================
  Files           53       53              
  Lines         3019     3061      +42     
===========================================
+ Hits          3019     3057      +38     
- Misses           0        4       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

AndreasAlbertQC

Thanks @MoritzPotthoffQC ! Small questions

AndreasAlbertQC · 2025-11-25T08:38:31Z

tests/collection/test_storage.py

+    ["tmp_path", pytest.param("s3_tmp_path", marks=pytest.mark.s3)],
+    indirect=True,
+)
+def test_read_write_parquet_old_schema_contents(


Should this test be parametrized over storage backends rather being parquet specific?

TODO: Also fix metadata on untyped writes and get rid of kwargs (?)

Updated the test to use @pytest.mark.parametrize("tester", TESTERS) and added set_metadata method to CollectionStorageTester with implementations for both Parquet and Delta backends. The test is now renamed to test_read_write_old_metadata_contents and uses the write_typed + set_metadata pattern. Commit: 9e65575

AndreasAlbertQC · 2025-11-25T08:40:25Z

tests/schema/test_storage.py

+    ["tmp_path", pytest.param("s3_tmp_path", marks=pytest.mark.s3)],
+    indirect=True,
+)
+def test_read_write_parquet_old_metadata_contents(


Same Q here

The schema test test_read_write_parquet_old_metadata_contents already uses @pytest.mark.parametrize("tester", TESTERS) and the set_metadata method. Updated to use the new DeserializationError exception. Commit: 9e65575

AndreasAlbertQC · 2025-11-25T08:41:19Z

dataframely/collection/collection.py

+        )
+    except (ValueError, TypeError, JSONDecodeError, plexc.ComputeError) as e:
+        if strict:
+            raise e from e


Would it make sense to raise a specific exception here that states in its text that we are raising because of strict?

TODO: raise DeserializationError("...") from e

@copilot raise a DeserializationError("The collection could not be deserialized") from e in this case. Create the exception.

Created DeserializationError exception and now raising it with raise DeserializationError("The collection could not be deserialized") from e (also applied to schema deserialization for consistency). The exception is exported from dataframely for public use. Commit: 9e65575

MoritzPotthoffQC · 2025-11-28T17:26:50Z

tests/collection/test_storage.py

+    tester.write_untyped(
+        collection,
+        any_tmp_path,
+        lazy,
+        metadata={
+            COLLECTION_METADATA_KEY: collection.serialize().replace(
+                "primary_key", "primary_keys"
+            )
+        },
+    )


@copilot currently, metadata is ignored on both write_untyped and write_typed. Please add methods set_metadata to the respective classes that overwrite the existing metadata with the given metadata and use them wherever a test currently uses the metadata kwarg of these methods.

Updated test_read_write_parquet_schema_json_fallback_corrupt to use write_untyped + set_metadata pattern instead of passing metadata kwarg. The set_metadata method was already added to CollectionStorageTester in a previous commit. Commit: 96d336a

…torage backends Co-authored-by: MoritzPotthoffQC <160181542+MoritzPotthoffQC@users.noreply.github.com>

…rquet_schema_json_fallback_corrupt Co-authored-by: MoritzPotthoffQC <160181542+MoritzPotthoffQC@users.noreply.github.com>

MoritzPotthoffQC · 2025-12-01T10:39:10Z

tests/collection/test_storage.py

+    tester.set_metadata(
        any_tmp_path,
-        lazy,
        metadata={COLLECTION_METADATA_KEY: "} this is not a valid JSON {"},


This was previously ignored (the test passed because there is no metadata in the first place).

MoritzPotthoffQC · 2025-12-01T12:45:16Z

.github/copilot-instructions.md

 1. **Python code**: Run `pixi run pre-commit run` before committing
 2. **Rust code**: Run `pixi run postinstall` to rebuild, then run tests
-3. **Tests**: Ensure `pixi run test` passes
+3. **Tests**: Ensure `pixi run test` passes. If changes might affect storage backends, use `pixi run test -m s3`.


I hope that copilot is clever enough to figure this out and this saves some running time versus always requiring this.

Initial plan

1516eb3

Copilot AI assigned Copilot and MoritzPotthoffQC Nov 21, 2025

Copilot started work on behalf of MoritzPotthoffQC November 21, 2025 14:32 View session

Copilot AI and others added 2 commits November 21, 2025 14:49

Add strict parameter to collection deserialization and update scan_pa…

047cb43

…rquet error handling Co-authored-by: MoritzPotthoffQC <160181542+MoritzPotthoffQC@users.noreply.github.com>

Fix docstring in collection storage test

ea5a3ae

Co-authored-by: MoritzPotthoffQC <160181542+MoritzPotthoffQC@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Catch errors when reading serialized schemas for collections~~ Add strict parameter to collection deserialization for backward compatibility Nov 21, 2025

Copilot AI requested a review from MoritzPotthoffQC November 21, 2025 14:54

Copilot finished work on behalf of MoritzPotthoffQC November 21, 2025 14:54

MoritzPotthoffQC added 4 commits November 21, 2025 18:20

fixes

bfde519

start fixing tests

88bb72e

wip

30b5514

fix

037b68c

MoritzPotthoffQC changed the title ~~Add strict parameter to collection deserialization for backward compatibility~~ feat: Recover from reading incompatible schema metadata if validation can be used Nov 24, 2025

github-actions bot added the enhancement New feature or request label Nov 24, 2025

MoritzPotthoffQC changed the title ~~feat: Recover from reading incompatible schema metadata if validation can be used~~ feat: Recover from reading incompatible schema metadata if validation is allowed Nov 24, 2025

MoritzPotthoffQC marked this pull request as ready for review November 24, 2025 14:04

MoritzPotthoffQC requested review from AndreasAlbertQC, borchero and delsner as code owners November 24, 2025 14:04

AndreasAlbertQC requested changes Nov 25, 2025

View reviewed changes

MoritzPotthoffQC marked this pull request as draft November 27, 2025 10:32

Copilot started work on behalf of MoritzPotthoffQC November 28, 2025 17:25 View session

MoritzPotthoffQC reviewed Nov 28, 2025

View reviewed changes

feat: Add DeserializationError exception and parametrize tests over s…

9e65575

…torage backends Co-authored-by: MoritzPotthoffQC <160181542+MoritzPotthoffQC@users.noreply.github.com>

Copilot AI requested a review from MoritzPotthoffQC November 28, 2025 17:44

Copilot finished work on behalf of MoritzPotthoffQC November 28, 2025 17:44

Copilot started work on behalf of MoritzPotthoffQC November 28, 2025 17:45 View session

fix: Use set_metadata instead of metadata kwarg in test_read_write_pa…

96d336a

…rquet_schema_json_fallback_corrupt Co-authored-by: MoritzPotthoffQC <160181542+MoritzPotthoffQC@users.noreply.github.com>

Copilot finished work on behalf of MoritzPotthoffQC November 28, 2025 18:00

MoritzPotthoffQC added 2 commits December 1, 2025 11:23

cleanup

aae41bf

fix

a38a6ee

MoritzPotthoffQC reviewed Dec 1, 2025

View reviewed changes

MoritzPotthoffQC requested a review from AndreasAlbertQC December 1, 2025 10:39

MoritzPotthoffQC added 2 commits December 1, 2025 11:52

fix?

674af39

fix

f0f137e

MoritzPotthoffQC reviewed Dec 1, 2025

View reviewed changes

feat: Recover from reading incompatible schema metadata if validation is allowed #231

Are you sure you want to change the base?

feat: Recover from reading incompatible schema metadata if validation is allowed #231

Uh oh!

Conversation

Copilot AI commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Uh oh!

codecov bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

AndreasAlbertQC left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Nov 21, 2025 •

edited

Loading

codecov bot commented Nov 21, 2025 •

edited

Loading