Skip to content

Fix repartition from dropping data when spilling#20672

Open
xanderbailey wants to merge 1 commit intoapache:mainfrom
xanderbailey:xb/fix_repartition
Open

Fix repartition from dropping data when spilling#20672
xanderbailey wants to merge 1 commit intoapache:mainfrom
xanderbailey:xb/fix_repartition

Conversation

@xanderbailey
Copy link
Contributor

@xanderbailey xanderbailey commented Mar 3, 2026

Which issue does this PR close?

Rationale for this change

In non-preserve-order repartitioning mode, all input partition tasks share clones of the same SpillPoolWriter for each output partition. SpillPoolWriter used #[derive(Clone)] but its Drop implementation unconditionally set writer_dropped = true and finalized the current spill file. This meant that when the first input task finished and its clone was dropped, the SpillPoolReader would see writer_dropped = true on an empty queue and return EOF — silently discarding every batch subsequently written by the still-running input tasks.

This bug requires three conditions to trigger:

  1. Non-preserve-order repartitioning (so spill writers are cloned across input tasks)
  2. Memory pressure causing batches to spill to disk
  3. Input tasks finishing at different times (the common case with varying partition sizes)

What changes are included in this PR?

datafusion/physical-plan/src/spill/spill_pool.rs:

  • Added active_writer_count: usize to SpillPoolShared to track the number of live writer clones.
  • Replaced #[derive(Clone)] on SpillPoolWriter with a manual Clone impl that increments active_writer_count under the shared lock.
  • Updated Drop to decrement active_writer_count and only finalize the current file / set writer_dropped = true when the count reaches zero (i.e. the last clone is dropped). Non-last clones now return immediately from Drop.
  • Added regression test test_clone_drop_does_not_signal_eof_prematurely that reproduces the exact failure: writer1 writes and drops, the reader drains the queue, then writer2 (still alive) writes. Without the fix the reader returns premature EOF and the assertion fails; with the fix the reader waits and reads both batches.

Are these changes tested?

Yes. A new unit test (test_clone_drop_does_not_signal_eof_prematurely) directly reproduces the bug. It was verified to fail without the fix and pass with the fix.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Mar 3, 2026
.await
.expect("Reader timed out — should not hang");

assert!(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this fix we fail here.

Copy link
Contributor

@hareshkh hareshkh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Repartition drops data when spilling

2 participants