chore: fix native shuffle for batches with no columns and 0 row count by comphead · Pull Request #3858 · apache/datafusion-comet

comphead · 2026-03-31T18:22:39Z

Which issue does this PR close?

Closes #3846.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

andygrove · 2026-03-31T22:15:07Z

spark/src/test/scala/org/apache/comet/exec/CometNativeShuffleSuite.scala

+      withSQLConf(
+        CometConf.COMET_NATIVE_SCAN_IMPL.key -> CometConf.SCAN_NATIVE_DATAFUSION,
+        CometConf.COMET_EXEC_SHUFFLE_WITH_ROUND_ROBIN_PARTITIONING_ENABLED.key -> "true") {


Is the issue specific to this combination of scan and shuffle?

~~interleave_record_batch is used in other parts of the shuffle codebase so those may also need updating?~~

It looks like native_datafusion is used here just to easily force native shuffle.

I am confused by the comment For zero-column batches (e.g. COUNT queries) when the test isn't using a count.

I was able to reproduce the crash with both native_datafusion and native_iceberg_compat in combination with native shuffle. the sample query for repro and test case is

spark.read.parquet("hdfs://location").repartition(50).count()

perhaps test can be slightly improved, if it confuses

martin-g · 2026-04-01T13:25:47Z

spark/src/test/scala/org/apache/comet/exec/CometNativeShuffleSuite.scala

+        val count = testDF.count()
+        assert(count == 1000)
+        // Ensure test df evaluated by Comet
+        checkSparkAnswerAndOperator(testDF)


There is no usage of count() here. Is this intentional ?
Another way could be something like:

val testDF = spark.read.parquet(dir.toString).repartition(10) val countDF = testDF.selectExpr("count(*) as cnt") val count = countDF.collect().head.getLong(0) assert(count == 1000) checkSparkAnswerAndOperator(countDF)

it is intentional, yes. Count returns just Long, I can't really inject in the middle to check native plan, so do it I check that at least everything before count is native which works for this case

mbutrovich · 2026-04-01T17:07:53Z

Is there something smarter we could be doing in this scenario (i.e., COUNT * batches arriving at shuffle layer)? If one batch falls into this scenario (no schema) won't all batches? in which case could we just create an accumulator and not generate a bunch of IPC batches, and output one batch at the end with the accumulator number of rows?

Maybe that's a premature optimization, but it seems a bit silly to me if we could end up writing a bunch of empty IPC batches.

mbutrovich · 2026-04-01T17:19:54Z

Can we get a more descriptive title and PR description? "thin batches" doesn't really convey what's happening. These are batches with no columns, right?

comphead · 2026-04-01T17:20:11Z

Maybe that's a premature optimization, but it seems a bit silly to me if we could end up writing a bunch of empty IPC batches.

IMO we filter them out inside shuffle writer but before IPC, but this is valid point, perhaps we can move this check up earlier, checking this

mbutrovich · 2026-04-01T17:24:57Z

Maybe that's a premature optimization, but it seems a bit silly to me if we could end up writing a bunch of empty IPC batches.

IMO we filter them out inside shuffle writer but before IPC, but this is valid point, perhaps we can move this check up earlier, checking this

You could detect schema.fields().is_empty() once at the start of partitioning_batch() and just accumulate partition_row_counts: Vec<usize> per partition. Then shuffle_write emits one RecordBatch per partition with the summed row count.

mbutrovich · 2026-04-02T15:55:23Z

native/shuffle/src/partitioners/multi_partition.rs

+            .map_err(|e| DataFusionError::Execution(format!("shuffle write error: {e:?}")))?;
+        let mut output_data = BufWriter::with_capacity(self.write_buffer_size, output_data);
+
+        // Distribute rows evenly: each partition gets total/N, first (total%N) get one extra


How does Spark handle this?

mbutrovich · 2026-04-02T15:56:22Z

native/shuffle/src/partitioners/multi_partition.rs

+
+        for (i, offset) in offsets[..num_output_partitions].iter_mut().enumerate() {
+            *offset = output_data.stream_position()?;
+            let row_count = base + if i < remainder { 1 } else { 0 };


This seems like a complicated way to handle the remainder? Why not just toss the remainder in partition 0 and be done with it? This makes the logic harder to follow.

mbutrovich · 2026-04-02T15:57:22Z

native/shuffle/src/partitioners/multi_partition.rs

+        for (i, offset) in offsets[..num_output_partitions].iter_mut().enumerate() {
+            *offset = output_data.stream_position()?;
+            let row_count = base + if i < remainder { 1 } else { 0 };
+            if row_count > 0 {


The correct behavior is to not emit a batch at all if row_count is 0? Just confirming. We don't need a sentinel batch with row_count of 0?

mbutrovich · 2026-04-02T15:58:38Z

native/shuffle/src/partitioners/multi_partition.rs

+        let mut output_data = BufWriter::with_capacity(self.write_buffer_size, output_data);
+
+        // Distribute rows evenly: each partition gets total/N, first (total%N) get one extra
+        let base = total_rows / num_output_partitions;


I'm still a bit confused why we partition at all in this case? Why not send all num_rows count to partition 0 in one batch, and leave the others empty? You'd effectively be doing the final aggregation/partition coalescing at this step, so I have no idea if that's valid for all aggregations that could yield this shuffle scenario, but it seems we're doing extra work here just to decode O(partitions) IPC batches with num_rows on the other side of the shuffle and final aggregation.

That's what I was originally thinking when we could catch this early and just write a single value.

I see now - we need to address this extreme use case with single partitioning cause the data transmitted is too small and no reason to spin entire shuffle pipeline for it. Apparently such batches can be only in extreme agg cases like this and shouldn't affect other queries. Lets see if it works

mbutrovich · 2026-04-02T16:00:57Z

Can we get a more descriptive title and PR description? "thin batches" doesn't really convey what's happening. These are batches with no columns, right?

I also think this got lost.

mbutrovich · 2026-04-02T18:21:39Z

native/shuffle/src/partitioners/multi_partition.rs

+            self.metrics.baseline.record_output(num_rows);
+            // All rows go to partition 0: partition_starts = [0, num_rows, num_rows, ...]
+            // partition_row_indices = [0, 1, 2, ..., num_rows-1]
+            let mut scratch = std::mem::take(&mut self.scratch);


This still looks way more complicated than what I would expect. Why do we need scratch space and to write num_rows partition_row_indices. Why are we "partitioning" rows that don't exist?

Just trying CI if single partition approach doesn't break anything

Its fine, shortened the PR, so shuffle steps for count batches

partitioning_batch sees num_columns() == 0, buffers the batch, pushes all row indices into partition_indices[0] — skips hashing

The IPC stream encodes the schema (no fields) and a single record batch message carrying just the row count in the metadata

native/shuffle/src/partitioners/multi_partition.rs

comphead added 2 commits March 31, 2026 08:21

chore: native_datafusion fails on repartition + count

953919c

chore: native_datafusion fails on repartition + count

3a3b715

comphead requested a review from andygrove March 31, 2026 18:23

comphead changed the title ~~Native datafusion~~ chore: fix native shuffle for thin batches Mar 31, 2026

fix branch predictability

c2a8f27

andygrove reviewed Mar 31, 2026

View reviewed changes

chore: native_datafusion fails on repartition + count

dadbec0

martin-g reviewed Apr 1, 2026

View reviewed changes

comphead added 2 commits April 1, 2026 19:20

short circuit

5120fd9

chore: native_datafusion fails on repartition + count

dbe2bf6

comphead requested review from andygrove, martin-g and mbutrovich April 2, 2026 15:20

mbutrovich reviewed Apr 2, 2026

View reviewed changes

comphead changed the title ~~chore: fix native shuffle for thin batches~~ chore: fix native shuffle for batches with no rows/columns Apr 2, 2026

comphead changed the title ~~chore: fix native shuffle for batches with no rows/columns~~ chore: fix native shuffle for batches with no columns and 0 row count Apr 2, 2026

chore: native_datafusion fails on repartition + count

88a3ece

mbutrovich reviewed Apr 2, 2026

View reviewed changes

chore: native_datafusion fails on repartition + count

5a04132

mbutrovich self-requested a review April 2, 2026 20:04

mbutrovich reviewed Apr 2, 2026

View reviewed changes

native/shuffle/src/partitioners/multi_partition.rs Outdated Show resolved Hide resolved

chore: native_datafusion fails on repartition + count

f64fe9e

mbutrovich added a commit to mbutrovich/datafusion-comet that referenced this pull request Apr 3, 2026

add EmptySchemaShufflePartitioner and test from apache#3858

8d71b65

mbutrovich mentioned this pull request Apr 3, 2026

fix: add EmptySchemaShufflePartitioner and test from #3858 #3893

Open

Conversation

comphead commented Mar 31, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

andygrove Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbutrovich commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbutrovich commented Apr 1, 2026

Uh oh!

comphead commented Apr 1, 2026

Uh oh!

mbutrovich commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbutrovich Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbutrovich commented Apr 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andygrove Mar 31, 2026 •

edited

Loading

mbutrovich commented Apr 1, 2026 •

edited

Loading

mbutrovich commented Apr 1, 2026 •

edited

Loading

mbutrovich Apr 2, 2026 •

edited

Loading