[SPARK-55600][PYTHON] Fix pandas to arrow loses row count when schema has 0 columns on classic by Yicong-Huang · Pull Request #54382 · apache/spark

Yicong-Huang · 2026-02-19T19:20:27Z

What changes were proposed in this pull request?

This PR fixes the row count loss issue when creating a Spark DataFrame from a pandas DataFrame with 0 columns in classic.

The issue occurs due to PyArrow limitations when creating RecordBatches or Tables with 0 columns - row count information is lost.

Why are the changes needed?

Before this fix:

import pandas as pd
from pyspark.sql.types import StructType

pdf = pd.DataFrame(index=range(5))  # 5 rows, 0 columns
df = spark.createDataFrame(pdf, schema=StructType([]))
df.count()  # Returns 0 (wrong!)

After this fix:

df.count()  # Returns 5 (correct!)

Does this PR introduce any user-facing change?

Yes. Creating a DataFrame from a pandas DataFrame with 0 columns now correctly preserves the row count in Classic Spark.

How was this patch tested?

Added unit test test_from_pandas_dataframe_with_zero_columns in test_creation.py that tests both Arrow-enabled and Arrow-disabled paths.

Was this patch authored or co-authored using generative AI tooling?

No

…ns to Arrow

holdenk · 2026-02-23T19:48:24Z

CC @devin-petersohn if you've got the time for a quick review would appreciate it, I don't really know the semantics of a 0 column dataframe.

devin-petersohn

@Yicong-Huang Arrow batches keep pandas metadata around, would that be a better, more efficient way of keeping track of the original length?

In [3]: pa.RecordBatch.from_pandas(pd.DataFrame(index=range(5))).schema.pandas_metadata
Out[3]: 
{'index_columns': [{'kind': 'range',
   'name': None,
   'start': 0,
   'stop': 5,
   'step': 1}],
 'column_indexes': [{'name': None,
   'field_name': None,
   'pandas_type': 'int64',
   'numpy_type': 'int64',
   'metadata': None}],
 'columns': [],
 'creator': {'library': 'pyarrow', 'version': '18.1.0'},
 'pandas_version': '2.2.3'}

Yicong-Huang · 2026-02-23T22:01:24Z

@Yicong-Huang Arrow batches keep pandas metadata around, would that be a better, more efficient way of keeping track of the original length?

Thanks @devin-petersohn for the suggestion. I tested locally indeed for this edge case, using metadata would be much more efficient. changes pushed.

devin-petersohn

Thanks for the changes, LGTM

…ation

Yicong-Huang · 2026-02-24T02:06:21Z

cc @ueshin

ueshin · 2026-02-24T19:35:02Z

python/pyspark/sql/pandas/conversion.py

+        # Handle the 0-column case separately to preserve row count.
+        # pa.RecordBatch.from_pandas preserves num_rows via pandas index metadata.
+        if len(pdf.columns) == 0:
+            arrow_batches = [pa.RecordBatch.from_pandas(pdf_slice) for pdf_slice in pdf_slices]


If this works more efficiently, shall we also use this in the similar parts, like:

spark/python/pyspark/sql/connect/session.py

Line 626 in 1e31b77

_table = pa.Table.from_struct_array(pa.array([{}] * len(data), type=pa.struct([])))

spark/python/pyspark/sql/conversion.py

Line 289 in 1e31b77

return pa.RecordBatch.from_struct_array(pa.array([{}] * len(data), arrow_type))

It can be in a separate PR, though.

thanks. Let's address them with a separate PR, since this one is for a bug fix.

ueshin · 2026-02-24T21:10:22Z

Thanks! merging to master.

Yicong-Huang added 3 commits February 19, 2026 11:20

fix: preserve row count when converting pandas DataFrame with 0 colum…

ded4c4d

…ns to Arrow

trigger ci

3ae38fc

retrigger ci

66a89db

devin-petersohn reviewed Feb 23, 2026

View reviewed changes

devin-petersohn approved these changes Feb 23, 2026

View reviewed changes

refactor: use pa.RecordBatch.from_pandas for 0-column arrow batch cre…

af89044

…ation

Yicong-Huang force-pushed the SPARK-55600/fix/pandas-arrow-zero-columns-row-count branch from 634bdda to af89044 Compare February 23, 2026 22:33

ci: trigger CI re-run

1e31b77

ueshin approved these changes Feb 24, 2026

View reviewed changes

ueshin closed this in 1e09fc8 Feb 24, 2026

Yicong-Huang mentioned this pull request Feb 25, 2026

[SPARK-55674][PYTHON] Optimize 0-column table conversion in Spark Connect #54468

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[SPARK-55600][PYTHON] Fix pandas to arrow loses row count when schema has 0 columns on classic#54382

[SPARK-55600][PYTHON] Fix pandas to arrow loses row count when schema has 0 columns on classic#54382
Yicong-Huang wants to merge 5 commits intoapache:masterfrom
Yicong-Huang:SPARK-55600/fix/pandas-arrow-zero-columns-row-count

Yicong-Huang commented Feb 19, 2026 •

edited

Loading

Uh oh!

holdenk commented Feb 23, 2026

Uh oh!

devin-petersohn left a comment

Uh oh!

Yicong-Huang commented Feb 23, 2026

Uh oh!

devin-petersohn left a comment

Uh oh!

Yicong-Huang commented Feb 24, 2026

Uh oh!

ueshin Feb 24, 2026

Uh oh!

Yicong-Huang Feb 24, 2026

Uh oh!

ueshin commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

Yicong-Huang commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

holdenk commented Feb 23, 2026

Uh oh!

devin-petersohn left a comment

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang commented Feb 23, 2026

Uh oh!

devin-petersohn left a comment

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang commented Feb 24, 2026

Uh oh!

ueshin Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

ueshin commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Yicong-Huang commented Feb 19, 2026 •

edited

Loading