feat(datafusion): Add opt-in eager file scan planning with output partitioning by toutane · Pull Request #2671 · apache/iceberg-rust

toutane · 2026-06-18T16:24:37Z

Which issue does this PR close?

What changes are included in this PR?

This PR adds an opt-in eager scan planning path for the DataFusion integration.

When iceberg.enable_eager_scan_planning is enabled, IcebergTableProvider::scan() plans FileScanTasks during physical planning, groups them across DataFusion target_partitions, and exposes the resulting output partition count as UnknownPartitioning(N). Each execute(partition) call then reads only the task group assigned to that output partition through ArrowReaderBuilder.

The default behavior is unchanged: eager planning is disabled by default, so scans keep the existing lazy single-partition planning path unless explicitly enabled.

This is a narrower split from #2298, focused on file-level eager planning and output partition count reporting. Some of the broader design discussions and review feedback happened in #2298; this PR keeps only the first scoped step and intentionally does not include hash partitioning, size-aware bin-packing, or row-group/sub-file planning.

Trade-offs:

Eager planning does catalog/metadata work during TableProvider::scan(), so it is kept opt-in
Task grouping is round-robin and count-based, not size-aware. This is simple and deterministic, but can be imbalanced when file sizes vary
Parallelism is at the FileScanTask level only. A table with one large file will not benefit from this change
The scan reports UnknownPartitioning(N), not hash partitioning. This exposes the number of output partitions without claiming stronger partitioning semantics

Follow-up work:

Add size-aware task bin-packing using file_size_in_bytes (Plan file scan task according scan file size. #128)
Add sub-file / row-group level parallelism as the longer-term direction (EPIC: Support parallel scan in iceberg-datafusion #1604)
Cache planned tasks for repeated scans with the same snapshot, projection, filter, and target partition count
Investigate stronger output partitioning declarations for safe identity-partitioned cases
Add benchmarks for multi-file tables and skewed file-size distributions

Are these changes tested?

Yes. Integration tests cover:

eager scan planning disabled by default
enabling eager scan planning through iceberg.enable_eager_scan_planning
exposing multiple scan output partitions when eager planning is enabled
preserving query results between single-partition and multi-partition scans

toutane added 2 commits June 18, 2026 17:43

feat(datafusion): add opt-in eager scan planning

ae94746

test(datafusion): cover lazy and eager scan planning

76dd438

toutane marked this pull request as ready for review June 19, 2026 08:13

toutane mentioned this pull request Jun 19, 2026

Enable parallel file-level scanning for IcebergTableScan Datafusion Integration #2220

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datafusion): Add opt-in eager file scan planning with output partitioning#2671

feat(datafusion): Add opt-in eager file scan planning with output partitioning#2671
toutane wants to merge 2 commits into
apache:mainfrom
toutane:datafusion-eager-file-scan-planning

toutane commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

toutane commented Jun 18, 2026

Which issue does this PR close?

What changes are included in this PR?

Trade-offs:

Follow-up work:

Are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant