[SPARK-55600][PYTHON] Fix pandas to arrow loses row count when schema has 0 columns on classic#54382
Conversation
|
CC @devin-petersohn if you've got the time for a quick review would appreciate it, I don't really know the semantics of a 0 column dataframe. |
devin-petersohn
left a comment
There was a problem hiding this comment.
@Yicong-Huang Arrow batches keep pandas metadata around, would that be a better, more efficient way of keeping track of the original length?
In [3]: pa.RecordBatch.from_pandas(pd.DataFrame(index=range(5))).schema.pandas_metadata
Out[3]:
{'index_columns': [{'kind': 'range',
'name': None,
'start': 0,
'stop': 5,
'step': 1}],
'column_indexes': [{'name': None,
'field_name': None,
'pandas_type': 'int64',
'numpy_type': 'int64',
'metadata': None}],
'columns': [],
'creator': {'library': 'pyarrow', 'version': '18.1.0'},
'pandas_version': '2.2.3'}
Thanks @devin-petersohn for the suggestion. I tested locally indeed for this edge case, using metadata would be much more efficient. changes pushed. |
devin-petersohn
left a comment
There was a problem hiding this comment.
Thanks for the changes, LGTM
634bdda to
af89044
Compare
|
cc @ueshin |
| # Handle the 0-column case separately to preserve row count. | ||
| # pa.RecordBatch.from_pandas preserves num_rows via pandas index metadata. | ||
| if len(pdf.columns) == 0: | ||
| arrow_batches = [pa.RecordBatch.from_pandas(pdf_slice) for pdf_slice in pdf_slices] |
There was a problem hiding this comment.
If this works more efficiently, shall we also use this in the similar parts, like:
spark/python/pyspark/sql/connect/session.py
Line 626 in 1e31b77
spark/python/pyspark/sql/conversion.py
Line 289 in 1e31b77
It can be in a separate PR, though.
There was a problem hiding this comment.
thanks. Let's address them with a separate PR, since this one is for a bug fix.
|
Thanks! merging to master. |
What changes were proposed in this pull request?
This PR fixes the row count loss issue when creating a Spark DataFrame from a pandas DataFrame with 0 columns in classic.
The issue occurs due to PyArrow limitations when creating RecordBatches or Tables with 0 columns - row count information is lost.
Why are the changes needed?
Before this fix:
After this fix:
Does this PR introduce any user-facing change?
Yes. Creating a DataFrame from a pandas DataFrame with 0 columns now correctly preserves the row count in Classic Spark.
How was this patch tested?
Added unit test
test_from_pandas_dataframe_with_zero_columnsintest_creation.pythat tests both Arrow-enabled and Arrow-disabled paths.Was this patch authored or co-authored using generative AI tooling?
No