refactor: align blob behavior that write via lance.blob.version, read via layout #5752

Xuanwo · 2026-01-19T15:14:29Z

This PR fixes a bug where users could encode blob v1 even when blob v2 was enabled. However, our decoder only reads the dataset's configuration lance.blob.version, which can lead to decoding issues.

In this PR, we changed the following:

During write operations, we determine which blob layout to use based on lance.blob.version (aka the user's BlobVersion).
During read operations, we decide which blob layout to decode based on its physical layout.

Parts of this PR were drafted with assistance from Codex (with gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.

github-actions · 2026-01-19T15:15:34Z

PR Review

Summary

This PR ensures that when blob v2 is enabled (file version >= 2.2), all blob-marked fields use the v2 encoder regardless of their input type (Binary/LargeBinary or Struct). The key addition is a normalize_blob_v2_input function that converts various input formats to the normalized v2 struct format.

Critical Issues (P0)

1. Potential panic from unchecked cast operations
In normalize_blob_v2_input, the code calls .as_binary::<i64>() and .as_string::<i32>() without checking if the actual array type matches. If the struct has a data field that is not LargeBinary or a uri field that is not Utf8, these will panic:

// blob.rs:455-457
let data_col = data_col.as_binary::<i64>();
let uri_col = uri_col.as_string::<i32>();

Consider using as_binary_opt and as_string_opt with proper error handling, similar to the pattern used in the Binary/LargeBinary branch.

Moderate Issues (P1)

2. Missing unit tests for normalize_blob_v2_input
The new normalize_blob_v2_input function handles several conversion paths (6-field struct passthrough, 2-field struct conversion, Binary, LargeBinary) but has no dedicated unit tests. Only the integration test in take.rs exercises one path. Add unit tests to cover:

Invalid struct with wrong number of fields
Struct with missing required fields
Binary vs LargeBinary input normalization
Null handling in various paths

3. Test assertion may be fragile
The test test_take_blob_v2_from_legacy_large_binary_on_v2_2 asserts exact field names and positions:

assert_eq!(struct_arr.fields()[0].name(), "kind");
assert_eq!(struct_arr.fields()[1].name(), "position");

Consider using column_by_name for field access instead of positional indexing, matching the pattern used in the encoder itself.

codecov · 2026-01-21T11:01:04Z

Codecov Report

❌ Patch coverage is 65.90563% with 224 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-encoding/src/encodings/logical/blob.rs	28.97%	150 Missing and 2 partials ⚠️
rust/lance-encoding/src/encoder.rs	65.00%	20 Missing and 1 partial ⚠️
rust/lance/src/dataset/blob.rs	87.50%	13 Missing and 2 partials ⚠️
rust/lance-file/src/writer.rs	60.00%	12 Missing ⚠️
rust/lance/src/dataset/write/insert.rs	85.29%	9 Missing and 1 partial ⚠️
...ust/lance-encoding/src/encodings/logical/struct.rs	0.00%	5 Missing and 1 partial ⚠️
rust/lance-arrow/src/lib.rs	28.57%	5 Missing ⚠️
rust/lance-encoding/src/testing.rs	82.35%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

westonpace

I think my initial question is why a user will specify v1 or v2?

westonpace · 2026-01-22T14:55:27Z

python/python/lance/dataset.py

    data_storage_version: Optional[
        Literal["stable", "2.0", "2.1", "2.2", "next", "legacy", "0.1"]
    ] = None,
+    blob_version: Optional[Literal["v1", "v2"]] = None,


We should document this parameter. Make sure to clearly specify when a user would choose v1. Is it only if they need backwards compatibility with a legacy software? Are there any good reasons a user would choose v1?

westonpace · 2026-01-22T14:56:18Z

rust/lance-arrow/src/lib.rs

 }

 fn project(struct_array: &StructArray, fields: &Fields) -> Result<StructArray> {
+    if struct_array.fields().len() != struct_array.columns().len() {


How do we get an Arrow array that is invalid like this?

westonpace · 2026-01-22T14:58:33Z

rust/lance-encoding/src/encodings/logical/blob.rs

    }
 }

+fn normalize_blob_v2_input(array: ArrayRef) -> Result<StructArray> {


Can you drop a short comment explaining what this function is doing?

fix: always use v2 encoder while blob v2 enabled

b903fb2

Xuanwo requested a review from westonpace January 19, 2026 15:14

Merge branch 'main' into luban/take-binary-panic

fa68a81

github-actions bot added the bug Something isn't working label Jan 19, 2026

Xuanwo added 2 commits January 20, 2026 20:33

Allow both v1 and v2 during reading

7103e71

Fix tests

7afd37d

Xuanwo changed the title ~~fix: always use v2 encoder while blob v2 enabled~~ refactor: align blob encode/decode that write via lance.blob.version, read via layout Jan 21, 2026

Xuanwo changed the title ~~refactor: align blob encode/decode that write via lance.blob.version, read via layout~~ refactor: align blob behavior that write via lance.blob.version, read via layout Jan 21, 2026

Xuanwo added 3 commits January 21, 2026 16:50

Fix python

234f2c2

Merge remote-tracking branch 'origin/main' into luban/take-binary-panic

ffcd455

Fix ci

e625966

github-actions bot added the python label Jan 21, 2026

Format code

31661a7

westonpace reviewed Jan 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: align blob behavior that write via lance.blob.version, read via layout #5752

refactor: align blob behavior that write via lance.blob.version, read via layout #5752

Xuanwo commented Jan 19, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 19, 2026

Uh oh!

codecov bot commented Jan 21, 2026

Uh oh!

westonpace left a comment

Uh oh!

westonpace Jan 22, 2026

Uh oh!

westonpace Jan 22, 2026

Uh oh!

westonpace Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

refactor: align blob behavior that write via lance.blob.version, read via layout #5752

Are you sure you want to change the base?

refactor: align blob behavior that write via lance.blob.version, read via layout #5752

Conversation

Xuanwo commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 19, 2026

PR Review

Summary

Critical Issues (P0)

Moderate Issues (P1)

Uh oh!

codecov bot commented Jan 21, 2026

Codecov Report

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

westonpace Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Xuanwo commented Jan 19, 2026 •

edited

Loading