[SPARK-54891] Unified type coercion for Arrow-backed Python UDFs #53666

fangchenli · 2026-01-03T22:29:47Z

What changes were proposed in this pull request?

This PR implements unified type coercion for Arrow-backed Python UDF. It adds a CoercionPolicy option to the config and the coerce() methods to DataType classes to control type conversion behavior when Arrow optimization is enabled.

Why are the changes needed?

When Arrow optimization is enabled for Python UDFs, the type coercion behavior differs from that of the pickle-based UDFs. We need to control the coercion behavior for backward compatibility and to help migrations.

Does this PR introduce any user-facing change?

Yes. A new configuration spark.sql.execution.pythonUDF.coercion.policy is added with three options:

PERMISSIVE: Matches pickle behavior - returns None for most type mismatches
WARN: Same as PERMISSIVE but logs warnings when Arrow would behave differently
STRICT: Arrow handles type conversion natively (original Arrow behavior)

With PERMISSIVE (default), users can enable Arrow optimization without behavior changes.

How was this patch tested?

Both unit tests and integration tests were added

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.5

This commit implements the STRICT policy as a no-op that lets Arrow handle type conversion natively, while PERMISSIVE/WARN policies implement pickle-compatible coercion behavior. Changes: - worker.py: Add conditional logic so STRICT skips coercion entirely - types.py: Update all coerce() methods to return value unchanged for STRICT - test_coercion.py: Update unit tests to verify STRICT no-op behavior - test_arrow_udf_coercion.py: Add integration tests comparing policies The integration tests verify: - PERMISSIVE matches pickle behavior exactly - WARN produces same results as PERMISSIVE - STRICT produces different results (Arrow's aggressive conversion) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions · 2026-01-03T22:29:57Z

JIRA Issue Information

=== Sub-task SPARK-54891 ===
Summary: Unified type coercion for Arrow-backed Python UDFs
Assignee: None
Status: Open
Affected: ["4.2.0"]

This comment was automatically generated by GitHub Actions

zhengruifeng

@fangchenli thanks for working on this.
There are 3 known different behaviors of python udf:
1, vanilla python udf, based on pickle;
2, arrow-optimized python udf (with legacy pandas conversion), with useArrow=True or spark.sql.execution.pythonUDF.arrow.enabled=True;
3, arrow-optimized python udf (without legacy pandas conversion), with 2 and spark.sql.legacy.execution.pythonUDF.pandas.conversion.enabled=False;

it seems this PR is for 2, I personally think it is a good idea if we can eliminate the behavior differences in both 1 vs 2 and 1 vs 3.

zhengruifeng · 2026-01-05T01:02:59Z

python/pyspark/worker.py

        return wrap_arrow_batch_udf_arrow(f, args_offsets, kwargs_offsets, return_type, runner_conf)


 def wrap_arrow_batch_udf_arrow(f, args_offsets, kwargs_offsets, return_type, runner_conf):


this code path is for arrow-optimized python udf without legcay pandas conversion, there is another path wrap_arrow_batch_udf_legacy for python udf with legcay pandas conversion

zhengruifeng · 2026-01-05T01:06:03Z

python/pyspark/worker.py

                (results, arrow_return_type, return_type).
                """
-                return list(pool.map(lambda row: func(*row), get_args(*args)))
+                return list(pool.map(lambda row: coerce_result(func(*row)), get_args(*args)))


I feel such coercion should happen in serializers.py

python/pyspark/sql/tests/test_arrow_udf_coercion.py

python/pyspark/sql/types.py

…ypes

gaogaotiantian · 2026-01-05T22:22:39Z

I have some detail suggestions on the code itself but I want to wait until we have a further discussion about this feature (just to avoid wasting our time).

The current implementation (or the concept as its own) will introduce a non-trivial overhead for data conversion. For simple and common types like integer, bytes, strings, we introduced multiple python-level function calls to each element, which could result in a perf regression - to default behavior.

The implementation is not complete either I think? We need to handle this in all container types I assume?

I'd suggest that we discuss whether we want to implement this - if the benefit trumps the overhead in both perf and maintenance, before we start reviewing the code itself.

fangchenli · 2026-01-06T00:20:37Z

I'd suggest that we discuss whether we want to implement this - if the benefit trumps the overhead in both perf and maintenance, before we start reviewing the code itself.

Thanks for the feedback. I'll benchmark it to determine the overhead.

fangchenli and others added 12 commits January 2, 2026 19:00

init commit

7b57c24

add test

91c8dc7

Merge remote-tracking branch 'upstream' into unified-type-coercion

4f98eb1

Merge remote-tracking branch 'upstream' into unified-type-coercion

8828377

simplify base class

79d07f4

cleanup

25d457d

cleanup

2bb4139

clean warning conditions

6b37adf

simplify code path

3d22ed6

simplify coerce logic

95c3a6c

fix test

1967a44

github-actions bot added SQL CORE PYTHON labels Jan 3, 2026

fangchenli changed the title ~~[WIP][SPARK-54891](https://issues.apache.org/jira/browse/SPARK-54891) Unified type coercion for Arrow-backed Python UDFs~~ [WIP][SPARK-54891]Unified type coercion for Arrow-backed Python UDFs Jan 3, 2026

fangchenli changed the title ~~[WIP][SPARK-54891]Unified type coercion for Arrow-backed Python UDFs~~ [WIP][SPARK-54891] Unified type coercion for Arrow-backed Python UDFs Jan 3, 2026

fangchenli added 3 commits January 3, 2026 17:55

only warn once per data type

7b14f41

Merge remote-tracking branch 'upstream' into unified-type-coercion

9d29f00

add mac test back

835172d

fangchenli marked this pull request as ready for review January 4, 2026 01:59

fangchenli changed the title ~~[WIP][SPARK-54891] Unified type coercion for Arrow-backed Python UDFs~~ [SPARK-54891] Unified type coercion for Arrow-backed Python UDFs Jan 4, 2026

fangchenli added 6 commits January 3, 2026 18:03

change target version

3cb01f0

fix more version

8bc4d7a

update arrow udf tests

3f32fa6

use pyspark error, reformat

82fcf95

apply strict coercion policy to pandas udf

443efd9

use Enum for py310 compat

f9308f2

zhengruifeng reviewed Jan 5, 2026

View reviewed changes

zhengruifeng requested review from HyukjinKwon, allisonwang-db and ueshin January 5, 2026 01:21

fangchenli added 4 commits January 4, 2026 21:30

Merge remote-tracking branch 'upstream' into unified-type-coercion

56f5031

add new tests to modules.py and use recursive coercion for complex dt…

d8c103a

…ypes

Merge remote-tracking branch 'upstream' into unified-type-coercion

c6fbb29

Merge remote-tracking branch 'upstream' into unified-type-coercion

7e34d5c

fangchenli added 5 commits January 5, 2026 21:16

Merge remote-tracking branch 'upstream' into unified-type-coercion

67e593f

move coercion to serializer

46c59ef

Merge remote-tracking branch 'upstream' into unified-type-coercion

48a1a54

Merge remote-tracking branch 'upstream' into unified-type-coercion

ce9e3d4

work around cache

626ce86

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54891] Unified type coercion for Arrow-backed Python UDFs #53666

[SPARK-54891] Unified type coercion for Arrow-backed Python UDFs #53666

fangchenli commented Jan 3, 2026

Uh oh!

github-actions bot commented Jan 3, 2026

Uh oh!

zhengruifeng left a comment

Uh oh!

zhengruifeng Jan 5, 2026

Uh oh!

zhengruifeng Jan 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gaogaotiantian commented Jan 5, 2026

Uh oh!

fangchenli commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return wrap_arrow_batch_udf_arrow(f, args_offsets, kwargs_offsets, return_type, runner_conf)


		def wrap_arrow_batch_udf_arrow(f, args_offsets, kwargs_offsets, return_type, runner_conf):

[SPARK-54891] Unified type coercion for Arrow-backed Python UDFs #53666

Are you sure you want to change the base?

[SPARK-54891] Unified type coercion for Arrow-backed Python UDFs #53666

Conversation

fangchenli commented Jan 3, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Jan 3, 2026

JIRA Issue Information

Uh oh!

zhengruifeng left a comment

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gaogaotiantian commented Jan 5, 2026

Uh oh!

fangchenli commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants