Skip to content

fix: add EmptySchemaShufflePartitioner and test from #3858#3893

Open
mbutrovich wants to merge 5 commits intoapache:mainfrom
mbutrovich:empty_schema_partitioner
Open

fix: add EmptySchemaShufflePartitioner and test from #3858#3893
mbutrovich wants to merge 5 commits intoapache:mainfrom
mbutrovich:empty_schema_partitioner

Conversation

@mbutrovich
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich commented Apr 3, 2026

Which issue does this PR close?

Closes #3846.

Rationale for this change

Native shuffle above a native scan that does not project any columns (e.g., COUNT(*)) results in RecordBatches with an empty schema but valid number of rows. Native shuffle currently panics trying to interleave those batches, but we can fast path this scenario with a special partitioner. It is similar to the SinglePartitionShufflePartitioner but instead of concatenating batches to write to a shuffle file for a single partition, it accumulates the number of rows, then writes a single IPC batch for the number of rows, but makes sure the index file has the expected number of partitions.

What changes are included in this PR?

  • native/shuffle/src/partitioners/empty_schema.rs: new EmptySchemaShufflePartitioner that accumulates row count, writes a single zero-column IPC batch to partition 0, and fills the index with equal offsets for all other partitions
  • native/shuffle/src/partitioners/mod.rs: exports the new partitioner
  • native/shuffle/src/shuffle_writer.rs: branches on schema.fields().is_empty() before falling through to MultiPartitionShuffleRepartitioner; added Rust test verifying row count roundtrip and index structure
  • spark/.../CometNativeShuffleSuite.scala: integration test from PR chore: fix native shuffle for batches with no columns and 0 row count #3858 for repartition(10).count() with native DataFusion scan

How are these changes tested?

New test from #3858 that reflects repro in #3846.

@mbutrovich mbutrovich changed the title add EmptySchemaShufflePartitioner and test from #3858 fix: add EmptySchemaShufflePartitioner and test from #3858 Apr 3, 2026
@mbutrovich mbutrovich marked this pull request as ready for review April 3, 2026 15:13
@mbutrovich
Copy link
Copy Markdown
Contributor Author

Based on grepping logs when I still has it at INFO level, these Spark SQL tests cover this codepath in addition to the unit test we added to CometNativeShuffleSuite:

  1. postgreSQL/union.sql
  2. subquery/exists-subquery/exists-orderby-limit.sql

/// This handles shuffles for operations like COUNT(*) that produce empty-schema record batches
/// but contain a valid row count. Accumulates the total row count and writes a single
/// zero-column IPC batch to partition 0. All other partitions get empty entries in the index file.
pub(crate) struct EmptySchemaShufflePartitioner {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be useful to attach a data flow graph or something, so can figure how data transforms across shuffle phases?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you have in mind for this one because this partitioner targets a very narrow type of queries. I think there are other resources to read about general Spark shuffle behavior.

#[async_trait::async_trait]
impl ShufflePartitioner for EmptySchemaShufflePartitioner {
async fn insert_batch(&mut self, batch: RecordBatch) -> datafusion::common::Result<()> {
let start_time = Instant::now();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to think if we need to wrap timings into macros and make them optional 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timers have cost, but in the grand scheme of Spark jobs that last hours or days, they're not the highest priority to optimize.

.map_err(|e| DataFusionError::Execution(format!("shuffle write error: {e:?}")))?;
let mut index_writer = BufWriter::new(index_file);
index_writer.write_all(&0i64.to_le_bytes())?;
for _ in 0..self.num_output_partitions {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.num_output_partitions ? Am I right it should be just 1 parittion?

Copy link
Copy Markdown
Contributor Author

@mbutrovich mbutrovich Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shuffle writer must write index entries for all target partitions, even if we're accumulating everything into a single batch in the first partition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

native_shuffle crashes for repartition + count

2 participants