feat: add batch_size_bytes option to file reader by westonpace · Pull Request #6388 · lance-format/lance

westonpace · 2026-04-02T13:43:20Z

Summary

Add batch_size_bytes: Option<u64> to SchedulerDecoderConfig and thread it through the structural v2.1 decode path
When set, compute rows-per-batch from byte estimates instead of the fixed batch_size row count
After each batch decodes, measure actual bytes-per-row and feed it back so subsequent batches converge toward the target byte size
Only the v2.1+ StructuralBatchDecodeStream is modified; legacy v2.0 BatchDecodeStream is unchanged (logs a warning if the option is set)

Test plan

test_estimate_bytes_per_row — unit test for the schema-based byte estimator
test_byte_sized_batches_fixed_width — 1000 rows × 4 Int32 columns, batch_size_bytes=1600 → 10 batches of exactly 100 rows, roundtrip verified
test_byte_sized_batches_none_unchanged — batch_size_bytes=None still uses rows_per_batch (no behavioral change)
test_byte_sized_batches_feedback_convergence — 100-byte strings with 64-byte schema estimate; verifies second/third batches converge to ~50 rows after feedback
cargo clippy -p lance-encoding --tests -p lance-file -- -D warnings clean
cargo fmt --all -- --check clean

🤖 Generated with Claude Code

Thread a new `batch_size_bytes: Option<u64>` option from `SchedulerDecoderConfig` through `create_decode_stream` into `StructuralBatchDecodeStream`. All existing call sites pass `None`, so there is no behavioral change. For legacy v2.0 files the option is ignored with a warning. Part of lance-format#6387 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When `batch_size_bytes` is `Some`, compute the number of rows to drain per batch from an estimated bytes-per-row instead of using `rows_per_batch`. The estimate is computed once from the schema using `estimate_bytes_per_row()`, which is exact for fixed-width types and uses rough defaults for variable-width types. Part of lance-format#6387 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

After each batch is decoded, measure the actual data bytes per row and feed it back so that the next `next_batch_task()` call uses the measured value instead of the schema-based estimate. This corrects for inaccurate initial estimates on variable-width data (strings, binary) where the schema default of 64 bytes may be far off. The measurement uses `batch_data_size()`, a new helper that computes the actual data contribution of a batch by walking column types and reading offsets for variable-width arrays. This avoids the over-counting from `get_array_memory_size()` which reports full shared page-buffer capacity rather than per-batch data. Part of lance-format#6387 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

westonpace · 2026-04-02T14:12:14Z

rust/lance-encoding/src/decoder.rs

+/// Compute the actual data size (in bytes) of a record batch,
+/// accounting only for the portion of buffers that belongs to the
+/// batch's row range. Unlike `get_array_memory_size()`, this does
+/// not over-count when arrays share a larger underlying page buffer.
+fn batch_data_size(batch: &RecordBatch) -> u64 {
+    batch
+        .columns()
+        .iter()
+        .map(|c| array_data_size(c.as_ref()))
+        .sum()
+}


I don't like this. I'm going to make a prequel PR to address getting the size of decoded batches

codecov · 2026-04-02T14:12:21Z

Codecov Report

❌ Patch coverage is 94.61538% with 14 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-encoding/src/decoder.rs	95.25%	9 Missing and 3 partials ⚠️
rust/lance-file/src/reader.rs	66.66%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

westonpace and others added 4 commits April 2, 2026 06:00

style: apply rustfmt

4fcfa97

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions bot added the enhancement New feature or request label Apr 2, 2026

westonpace marked this pull request as draft April 2, 2026 14:03

westonpace commented Apr 2, 2026

View reviewed changes

westonpace mentioned this pull request Apr 2, 2026

feat: thread data_size through decode pipeline #6391

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add batch_size_bytes option to file reader#6388

feat: add batch_size_bytes option to file reader#6388
westonpace wants to merge 4 commits intolance-format:mainfrom
westonpace:feat/byte-sized-batches-api

westonpace commented Apr 2, 2026

Uh oh!

westonpace Apr 2, 2026

Uh oh!

codecov bot commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

westonpace commented Apr 2, 2026

Summary

Test plan

Uh oh!

westonpace Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 2, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant