[WIP][SPARK-54647][PYTHON] Support User-Defined Aggregate Functions (UDAF) by Yicong-Huang · Pull Request #53400 · apache/spark

Yicong-Huang · 2025-12-09T00:35:10Z

What changes were proposed in this pull request?

Add support for User-Defined Aggregate Functions (UDAF) in PySpark. Currently PySpark supports User-Defined Functions (UDF) and User-Defined Table Functions (UDTF), but lacks support for UDAF. Users need to write custom aggregation logic in Scala/Java or use less efficient workarounds.

This change adds UDAF support using a two-stage aggregation pattern with mapInArrow and applyInArrow. The basic idea is to implement aggregation (and partial aggregation) by:

df.selectExpr("rand() as key").mapInArrow(reduce).groupBy(key).applyInArrow(merge)

Where func1 calls Aggregator.reduce() for partial aggregation within each partition, and func2 calls Aggregator.merge() to combine partial results, then Aggregator.finish() for final results.

Aligned with Scala side, the implementation provides a Python Aggregator base class that users can subclass:

class Aggregator:
    def zero(self) -> Any:
        """Return zero value for aggregation buffer"""
        raise NotImplementedError
    
    def reduce(self, buffer: Any, value: Any) -> Any:
        """Combine input value into buffer"""
        raise NotImplementedError
    
    def merge(self, buffer1: Any, buffer2: Any) -> Any:
        """Merge two intermediate buffers"""
        raise NotImplementedError
    
    def finish(self, reduction: Any) -> Any:
        """Produce final result from buffer"""
        raise NotImplementedError

Users can create UDAF instances using the udaf() function and use them with DataFrame.agg():

sum_udaf = udaf(MySum(), "bigint")
df.agg(sum_udaf(df.value))
df.groupBy("group").agg(sum_udaf(df.value))

Key changes:

Added pyspark.sql.udaf module with Aggregator base class, UserDefinedAggregateFunction wrapper, and udaf() factory function
Integrated UDAF support in GroupedData.agg() by detecting UDAF columns via _udaf_func attribute

Why are the changes needed?

Currently PySpark lacks support for User-Defined Aggregate Functions (UDAF), which limits users' ability to express complex aggregation logic directly in Python. Users must either write custom aggregation logic in Scala/Java or use less efficient workarounds. This change adds UDAF support to complement existing UDF and UDTF support in PySpark, aligning with the Scala/Java Aggregator interface in org.apache.spark.sql.expressions.Aggregator.

Does this PR introduce any user-facing change?

Yes. This PR adds a new feature - User-Defined Aggregate Functions (UDAF) support in PySpark. Users can now define custom aggregation logic by subclassing the Aggregator class and using the udaf() function to create UDAF instances that can be used with DataFrame.agg() and GroupedData.agg().

Example:

class MySum(Aggregator):
    def zero(self):
        return 0
    def reduce(self, buffer, value):
        return buffer + value
    def merge(self, buffer1, buffer2):
        return buffer1 + buffer2
    def finish(self, reduction):
        return reduction

sum_udaf = udaf(MySum(), "bigint")
df.agg(sum_udaf(df.value))

How was this patch tested?

Added comprehensive unit tests in python/pyspark/sql/tests/test_udaf.py covering:

Basic aggregation (sum, average, max)
Grouped aggregation with groupBy().agg()
Null value handling
Empty DataFrame handling
Large datasets (20000+ rows) distributed across partitions
Error handling for invalid inputs
Integration with df.agg() and df.groupBy().agg()

Was this patch authored or co-authored using generative AI tooling?

No.

allisonwang-db

Nice feature!

allisonwang-db · 2025-12-09T18:01:39Z

+    def test_udaf_mixed_with_other_agg_not_supported(self):
+        """Test that mixing UDAF with other aggregate functions raises error."""
+
+        class MySum(Aggregator):


Can we add some tests for more complicated data structures? like dictionary?

added more data types!

zhengruifeng · 2025-12-11T03:39:16Z

+]
+
+
+class Aggregator:


do we necessarily need this class?
I see UDTF doesn't need a base class.

>>> class TestUDTF: ... def eval(self, *args: Any): ... yield "hello", "world"

we could do duck typing if not go with the inheritance. I think it is debatable and we could offer both solutions (with or without a base class).

zhengruifeng · 2025-12-11T03:45:49Z

+        Apply this UDAF to the given columns.
+
+        This creates a Column expression that can be used in DataFrame operations.
+        The actual aggregation is performed using mapInArrow and applyInArrow.


why not a dedicated pyhsical plan?

zhengruifeng · 2025-12-11T03:50:04Z

+        -----
+        This implementation uses mapInArrow and applyInArrow internally to perform
+        the aggregation. The approach follows:
+        1. mapInArrow: Performs partial aggregation (reduce) on each partition


If we want to support partial aggregation with existing arrow UDFs, I think we should use a modified FlatMapGroupsInArrowExec with requiredChildDistribution = UnspecifiedDistribution.

zhengruifeng · 2025-12-11T03:57:41Z

+ * MapInArrow, Aggregate, and FlatMapGroupsInArrow operators.
+ *
+ * This implements a three-phase aggregation pattern:
+ * 1. Partial aggregation (MapInArrow): Applies reduce() on each partition, outputs


MapInArrowExec dosen't requiredChildOrdering, where does it sort the data for partial aggregation?

The Sort is now explicitly added in RewritePythonAggregatorUDAF before MapInArrow.

zhengruifeng · 2025-12-11T04:02:24Z

The basic idea is to implement aggregation (and partial aggregation) by:
df.selectExpr("rand() as key").mapInArrow(reduce).groupBy(key).applyInArrow(merge)

I think there should be a sortWithinPartitions before mapInArrow for partial aggregation.

zhengruifeng · 2025-12-11T04:05:38Z

The whole approach is based on mapInArrow and applyInArrow, how does it support function registration so that it can be used in SQL?

zhengruifeng · 2025-12-11T04:13:24Z

+                        group_buffers[grouping_key] = agg.zero()
+
+                    if value is not None:
+                        group_buffers[grouping_key] = agg.reduce(group_buffers[grouping_key], value)


group_buffers buffers all the aggregators within a partition, it will cause memory issue if the cardinality is large.
A reasonable physical plan should sort the partition by the key, and then output the partial aggregation result after finishing each group

it mimic the HashAggregateExec, while SortAggregateExec is more stable

thanks for the suggestion. I will take a look on different aggregateExec implementations.

dtenedor · 2026-04-17T16:43:16Z

@Yicong-Huang please let me help you as a reviewer for this, I implemented remote UDAFs several times for other systems preivously

Yicong-Huang · 2026-04-24T23:27:44Z

@dtenedor thanks! could you please have a pass on the current implementation?

github-actions Bot added SQL PYTHON labels Dec 9, 2025

allisonwang-db reviewed Dec 9, 2025

View reviewed changes

Yicong-Huang changed the title ~~[SPARK-54647][PYTHON] Support User-Defined Aggregate Functions (UDAF)~~ [WIP][SPARK-54647][PYTHON] Support User-Defined Aggregate Functions (UDAF) Dec 10, 2025

Yicong-Huang marked this pull request as draft December 10, 2025 01:36

zhengruifeng reviewed Dec 11, 2025

View reviewed changes

github-actions Bot added the CONNECT label Dec 17, 2025

Yicong-Huang force-pushed the SPARK-54647/feat/add-udaf-support branch from 287e949 to 0abe9be Compare April 25, 2026 04:00

Yicong-Huang added 17 commits April 27, 2026 09:10

feat: add udaf support

ca7b0e0

fix: remove comments

4008dd4

fix: format

290869e

feat: use logicial plan to implement UDAF

bfea77f

chore: remove pure python implementation

f52a561

feat: require staticmethod

b751c91

test: add tests for different types

94c6dd3

fix: format

7739385

fix: doc

4ff5987

fix: doc test

4fbead0

fix: import

d8d454d

fix: remove assumption on column order

a16fa19

test: combine tests

d24f02b

fix: check __dict__ directly for UDAF column detection

6eb87c4

fix: tests

6db50a4

fix(test): use proper PythonUDF mock instead of Literal

7fc9eb9

fix: format

8570942

Yicong-Huang added 4 commits April 27, 2026 09:10

fix: merge

33ab99f

fix: lint errors (ruff unused import, format, mypy)

441bb1a

fix: use vendored cloudpickle, register test_udaf module

da3bc48

fix: use valid error classes (ATTRIBUTE_NOT_CALLABLE, NOT_EXPECTED_TYPE)

54e2d62

Yicong-Huang force-pushed the SPARK-54647/feat/add-udaf-support branch from 91b725d to 54e2d62 Compare April 27, 2026 09:10

Conversation

Yicong-Huang commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

allisonwang-db left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Dec 11, 2025

Uh oh!

zhengruifeng commented Dec 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dtenedor commented Apr 17, 2026

Uh oh!

Yicong-Huang commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Yicong-Huang commented Dec 9, 2025 •

edited

Loading

zhengruifeng Dec 11, 2025 •

edited

Loading