Skip to content

[SPARK-55674][PYTHON] Optimize 0-column table conversion in Spark Connect#54468

Closed
Yicong-Huang wants to merge 1 commit intoapache:masterfrom
Yicong-Huang:SPARK-55674/followup/unify-zero-column-pandas-arrow-fix
Closed

[SPARK-55674][PYTHON] Optimize 0-column table conversion in Spark Connect#54468
Yicong-Huang wants to merge 1 commit intoapache:masterfrom
Yicong-Huang:SPARK-55674/followup/unify-zero-column-pandas-arrow-fix

Conversation

@Yicong-Huang
Copy link
Contributor

@Yicong-Huang Yicong-Huang commented Feb 25, 2026

What changes were proposed in this pull request?

Replace pa.Table.from_struct_array(pa.array([{}] * len(data), type=pa.struct([]))) with pa.Table.from_batches([pa.RecordBatch.from_pandas(data)]) in connect/session.py when handling 0-column pandas DataFrames. This is O(1) operation, regardless how many rows are there.

Why are the changes needed?

The original approach constructs len(data) Python dict objects ([{}] * len(data)), which is O(n). pa.RecordBatch.from_pandas is an O(1) operation regardless of the number of rows, as it reads row
count directly from pandas index metadata without allocating per-row Python objects.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@Yicong-Huang Yicong-Huang changed the title [SPARK-55674][PYTHON] Use pa.RecordBatch.from_pandas for 0-column table in Spark Connect session [SPARK-55674][PYTHON] Optimize 0-column table conversion in Spark Connect session Feb 25, 2026
@Yicong-Huang Yicong-Huang changed the title [SPARK-55674][PYTHON] Optimize 0-column table conversion in Spark Connect session [SPARK-55674][PYTHON] Optimize 0-column table conversion in Spark Connect Feb 25, 2026
@ueshin
Copy link
Member

ueshin commented Feb 25, 2026

Can't we apply this to

return pa.RecordBatch.from_struct_array(pa.array([{}] * len(data), arrow_type))

?

@Yicong-Huang
Copy link
Contributor Author

Can't we apply this to ?

Not for this case. This case is "data" is empty but schema is non empty, so we could not use convert and preserve the information from data: the columns will mismatch.

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending tests.

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants