Skip to content

Conversation

@fangchenli
Copy link

What changes were proposed in this pull request?

This PR implements unified type coercion for Arrow-backed Python UDF. It adds a CoercionPolicy option to the config and the coerce() methods to DataType classes to control type conversion behavior when Arrow optimization is enabled.

Why are the changes needed?

When Arrow optimization is enabled for Python UDFs, the type coercion behavior differs from that of the pickle-based UDFs. We need to control the coercion behavior for backward compatibility and to help migrations.

Does this PR introduce any user-facing change?

Yes. A new configuration spark.sql.execution.pythonUDF.coercion.policy is added with three options:

  • PERMISSIVE: Matches pickle behavior - returns None for most type mismatches
  • WARN: Same as PERMISSIVE but logs warnings when Arrow would behave differently
  • STRICT: Arrow handles type conversion natively (original Arrow behavior)

With PERMISSIVE (default), users can enable Arrow optimization without behavior changes.

How was this patch tested?

Both unit tests and integration tests were added

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.5

fangchenli and others added 12 commits January 2, 2026 19:00
This commit implements the STRICT policy as a no-op that lets Arrow
handle type conversion natively, while PERMISSIVE/WARN policies
implement pickle-compatible coercion behavior.

Changes:
- worker.py: Add conditional logic so STRICT skips coercion entirely
- types.py: Update all coerce() methods to return value unchanged for STRICT
- test_coercion.py: Update unit tests to verify STRICT no-op behavior
- test_arrow_udf_coercion.py: Add integration tests comparing policies

The integration tests verify:
- PERMISSIVE matches pickle behavior exactly
- WARN produces same results as PERMISSIVE
- STRICT produces different results (Arrow's aggressive conversion)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Jan 3, 2026

JIRA Issue Information

=== Sub-task SPARK-54891 ===
Summary: Unified type coercion for Arrow-backed Python UDFs
Assignee: None
Status: Open
Affected: ["4.2.0"]


This comment was automatically generated by GitHub Actions

@fangchenli fangchenli changed the title [WIP][SPARK-54891](https://issues.apache.org/jira/browse/SPARK-54891) Unified type coercion for Arrow-backed Python UDFs [WIP][SPARK-54891]Unified type coercion for Arrow-backed Python UDFs Jan 3, 2026
@fangchenli fangchenli changed the title [WIP][SPARK-54891]Unified type coercion for Arrow-backed Python UDFs [WIP][SPARK-54891] Unified type coercion for Arrow-backed Python UDFs Jan 3, 2026
@fangchenli fangchenli marked this pull request as ready for review January 4, 2026 01:59
@fangchenli fangchenli changed the title [WIP][SPARK-54891] Unified type coercion for Arrow-backed Python UDFs [SPARK-54891] Unified type coercion for Arrow-backed Python UDFs Jan 4, 2026
Copy link
Contributor

@zhengruifeng zhengruifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fangchenli thanks for working on this.
There are 3 known different behaviors of python udf:
1, vanilla python udf, based on pickle;
2, arrow-optimized python udf (with legacy pandas conversion), with useArrow=True or spark.sql.execution.pythonUDF.arrow.enabled=True;
3, arrow-optimized python udf (without legacy pandas conversion), with 2 and spark.sql.legacy.execution.pythonUDF.pandas.conversion.enabled=False;

it seems this PR is for 2, I personally think it is a good idea if we can eliminate the behavior differences in both 1 vs 2 and 1 vs 3.

return wrap_arrow_batch_udf_arrow(f, args_offsets, kwargs_offsets, return_type, runner_conf)


def wrap_arrow_batch_udf_arrow(f, args_offsets, kwargs_offsets, return_type, runner_conf):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code path is for arrow-optimized python udf without legcay pandas conversion, there is another path wrap_arrow_batch_udf_legacy for python udf with legcay pandas conversion

(results, arrow_return_type, return_type).
"""
return list(pool.map(lambda row: func(*row), get_args(*args)))
return list(pool.map(lambda row: coerce_result(func(*row)), get_args(*args)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel such coercion should happen in serializers.py

@gaogaotiantian
Copy link
Contributor

I have some detail suggestions on the code itself but I want to wait until we have a further discussion about this feature (just to avoid wasting our time).

The current implementation (or the concept as its own) will introduce a non-trivial overhead for data conversion. For simple and common types like integer, bytes, strings, we introduced multiple python-level function calls to each element, which could result in a perf regression - to default behavior.

The implementation is not complete either I think? We need to handle this in all container types I assume?

I'd suggest that we discuss whether we want to implement this - if the benefit trumps the overhead in both perf and maintenance, before we start reviewing the code itself.

@fangchenli
Copy link
Author

I'd suggest that we discuss whether we want to implement this - if the benefit trumps the overhead in both perf and maintenance, before we start reviewing the code itself.

Thanks for the feedback. I'll benchmark it to determine the overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants