Skip to content

feat: add batch_size_bytes option to file reader#6388

Draft
westonpace wants to merge 4 commits intolance-format:mainfrom
westonpace:feat/byte-sized-batches-api
Draft

feat: add batch_size_bytes option to file reader#6388
westonpace wants to merge 4 commits intolance-format:mainfrom
westonpace:feat/byte-sized-batches-api

Conversation

@westonpace
Copy link
Copy Markdown
Member

Summary

  • Add batch_size_bytes: Option<u64> to SchedulerDecoderConfig and thread it through the structural v2.1 decode path
  • When set, compute rows-per-batch from byte estimates instead of the fixed batch_size row count
  • After each batch decodes, measure actual bytes-per-row and feed it back so subsequent batches converge toward the target byte size
  • Only the v2.1+ StructuralBatchDecodeStream is modified; legacy v2.0 BatchDecodeStream is unchanged (logs a warning if the option is set)

Test plan

  • test_estimate_bytes_per_row — unit test for the schema-based byte estimator
  • test_byte_sized_batches_fixed_width — 1000 rows × 4 Int32 columns, batch_size_bytes=1600 → 10 batches of exactly 100 rows, roundtrip verified
  • test_byte_sized_batches_none_unchangedbatch_size_bytes=None still uses rows_per_batch (no behavioral change)
  • test_byte_sized_batches_feedback_convergence — 100-byte strings with 64-byte schema estimate; verifies second/third batches converge to ~50 rows after feedback
  • cargo clippy -p lance-encoding --tests -p lance-file -- -D warnings clean
  • cargo fmt --all -- --check clean

Closes #6387

🤖 Generated with Claude Code

westonpace and others added 4 commits April 2, 2026 06:00
Thread a new `batch_size_bytes: Option<u64>` option from
`SchedulerDecoderConfig` through `create_decode_stream` into
`StructuralBatchDecodeStream`. All existing call sites pass `None`,
so there is no behavioral change. For legacy v2.0 files the option
is ignored with a warning.

Part of lance-format#6387

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When `batch_size_bytes` is `Some`, compute the number of rows to
drain per batch from an estimated bytes-per-row instead of using
`rows_per_batch`. The estimate is computed once from the schema
using `estimate_bytes_per_row()`, which is exact for fixed-width
types and uses rough defaults for variable-width types.

Part of lance-format#6387

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After each batch is decoded, measure the actual data bytes per row
and feed it back so that the next `next_batch_task()` call uses the
measured value instead of the schema-based estimate. This corrects
for inaccurate initial estimates on variable-width data (strings,
binary) where the schema default of 64 bytes may be far off.

The measurement uses `batch_data_size()`, a new helper that computes
the actual data contribution of a batch by walking column types and
reading offsets for variable-width arrays. This avoids the
over-counting from `get_array_memory_size()` which reports full
shared page-buffer capacity rather than per-batch data.

Part of lance-format#6387

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the enhancement New feature or request label Apr 2, 2026
@westonpace westonpace marked this pull request as draft April 2, 2026 14:03
Comment on lines +1686 to +1696
/// Compute the actual data size (in bytes) of a record batch,
/// accounting only for the portion of buffers that belongs to the
/// batch's row range. Unlike `get_array_memory_size()`, this does
/// not over-count when arrays share a larger underlying page buffer.
fn batch_data_size(batch: &RecordBatch) -> u64 {
batch
.columns()
.iter()
.map(|c| array_data_size(c.as_ref()))
.sum()
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this. I'm going to make a prequel PR to address getting the size of decoded batches

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 2, 2026

Codecov Report

❌ Patch coverage is 94.61538% with 14 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-encoding/src/decoder.rs 95.25% 9 Missing and 3 partials ⚠️
rust/lance-file/src/reader.rs 66.66% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support byte-sized batch limits in file reader

1 participant