Skip to content

Comments

[SPARK-55502][PYTHON] Unify UDF and UDTF Arrow conversion error handling#54398

Open
Yicong-Huang wants to merge 3 commits intoapache:masterfrom
Yicong-Huang:SPARK-55502/refactor/eliminate-is-udtf-flag
Open

[SPARK-55502][PYTHON] Unify UDF and UDTF Arrow conversion error handling#54398
Yicong-Huang wants to merge 3 commits intoapache:masterfrom
Yicong-Huang:SPARK-55502/refactor/eliminate-is-udtf-flag

Conversation

@Yicong-Huang
Copy link
Contributor

@Yicong-Huang Yicong-Huang commented Feb 20, 2026

What changes were proposed in this pull request?

Remove the is_udtf parameter from PandasToArrowConversion.convert() and unify the error handling logic for both UDF and UDTF conversions.

Key changes:

  • Removed is_udtf: bool parameter from conversion methods
  • Unified exception handling: all conversions now use broad ArrowException catching
  • Replaced UDTF-specific UDTF_ARROW_TYPE_CAST_ERROR with generic PySparkTypeError/PySparkValueError

Why are the changes needed?

The is_udtf flag was used to differentiate error handling between UDF and UDTF, but this created unnecessary complexity and inconsistent error messages. Unifying the logic provides:

  • Simpler, more maintainable code
  • Consistent error messages across UDF/UDTF

Does this PR introduce any user-facing change?

Yes, user would see a different error message (error messages change from UDTF-specific to generic), but functionality remains the same.

Before (UDTF):

PySparkRuntimeError: UDTF_ARROW_TYPE_CAST_ERROR
  Cannot convert the output value of the column 'x' with type 'int64' to the specified
  return type of the column: 'struct<a: int32>'. Please check if the data types match
  and try again.

Before (UDF):

PySparkTypeError
  Exception thrown when converting pandas.Series (int64) with name 'x' to Arrow Array
  (struct<a: int32>).

After (unified):

PySparkTypeError
  Cannot convert column 'x' of pandas type 'int64' to Arrow type 'struct<a: int32>'. The
  data type is not compatible with the specified return type. Please verify the return
  type annotation matches the actual data.

ValueError path (e.g. string → double)

Before (UDTF):

PySparkRuntimeError: UDTF_ARROW_TYPE_CAST_ERROR
  Cannot convert the output value of the column 'val' with type 'object' to the
  specified return type of the column: 'double'. Please check if the data types match
  and try again.

Before (UDF):

PySparkValueError
  Exception thrown when converting pandas.Series (object) with name 'val' to Arrow Array
  (double).

After (unified):

PySparkValueError
  Cannot convert column 'val' of pandas type 'object' to Arrow type 'double'. Please
  verify the data values are compatible with the specified return type.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@Yicong-Huang Yicong-Huang force-pushed the SPARK-55502/refactor/eliminate-is-udtf-flag branch from 4b2718e to e250ebf Compare February 20, 2026 17:35
Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add before and after error messages PR descriptions?

@Yicong-Huang
Copy link
Contributor Author

Can we add before and after error messages PR descriptions?

Thanks for the suggestion. I've added them in the PR description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants