Skip to content

feat(datafusion): Add opt-in eager file scan planning with output partitioning#2671

Open
toutane wants to merge 2 commits into
apache:mainfrom
toutane:datafusion-eager-file-scan-planning
Open

feat(datafusion): Add opt-in eager file scan planning with output partitioning#2671
toutane wants to merge 2 commits into
apache:mainfrom
toutane:datafusion-eager-file-scan-planning

Conversation

@toutane

@toutane toutane commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

What changes are included in this PR?

This PR adds an opt-in eager scan planning path for the DataFusion integration.

When iceberg.enable_eager_scan_planning is enabled, IcebergTableProvider::scan() plans FileScanTasks during physical planning, groups them across DataFusion target_partitions, and exposes the resulting output partition count as UnknownPartitioning(N). Each execute(partition) call then reads only the task group assigned to that output partition through ArrowReaderBuilder.

The default behavior is unchanged: eager planning is disabled by default, so scans keep the existing lazy single-partition planning path unless explicitly enabled.

This is a narrower split from #2298, focused on file-level eager planning and output partition count reporting. Some of the broader design discussions and review feedback happened in #2298; this PR keeps only the first scoped step and intentionally does not include hash partitioning, size-aware bin-packing, or row-group/sub-file planning.

Trade-offs:

  • Eager planning does catalog/metadata work during TableProvider::scan(), so it is kept opt-in
  • Task grouping is round-robin and count-based, not size-aware. This is simple and deterministic, but can be imbalanced when file sizes vary
  • Parallelism is at the FileScanTask level only. A table with one large file will not benefit from this change
  • The scan reports UnknownPartitioning(N), not hash partitioning. This exposes the number of output partitions without claiming stronger partitioning semantics

Follow-up work:

Are these changes tested?

Yes. Integration tests cover:

  • eager scan planning disabled by default
  • enabling eager scan planning through iceberg.enable_eager_scan_planning
  • exposing multiple scan output partitions when eager planning is enabled
  • preserving query results between single-partition and multi-partition scans

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable parallel file-level scanning for IcebergTableScan Datafusion Integration

1 participant