Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Dec 27, 2025

Which issue does this PR close?

  • Closes #.

Rationale for this change

ContainsFunc used make_scalar_function, which expands scalar arguments into arrays before calling Arrow's contains function, bypassing Arrow's built-in scalar optimization.

After optimizing for the scalar search case, performance is much better:

contains_StringArray_scalar_search
                        time:   [40.105 µs 40.273 µs 40.448 µs]
                        change: [−84.713% −84.651% −84.593%] (p = 0.00 < 0.05)
                        Performance has improved.

contains_StringViewArray_scalar_search
                        time:   [41.692 µs 41.823 µs 41.953 µs]
                        change: [−88.150% −88.103% −88.053%] (p = 0.00 < 0.05)
                        Performance has improved.

What changes are included in this PR?

Rewrote invoke_with_args to:

  1. Handle the four cases directly (scalar/scalar, array/scalar, scalar/array, array/array)
  2. Use Arrow's Scalar wrapper when passing scalar arguments
  3. Let Arrow's contains function use its optimized op_scalar path when the second argument is a scalar

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the functions Changes to functions implementation label Dec 27, 2025
@andygrove andygrove marked this pull request as draft December 27, 2025 19:09
@andygrove andygrove marked this pull request as ready for review December 27, 2025 19:12
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove the numbers make huge sense to me, what prob concerns is DF using make_scalar_function in lots of places so the same problem can be relevant for others.

Would you mind adding more details what is the exact reason and what scalar/array combinations are the most trouble?

this info might be important how DF treats bultin functions now

@andygrove
Copy link
Member Author

Thanks @andygrove the numbers make huge sense to me, what prob concerns is DF using make_scalar_function in lots of places so the same problem can be relevant for others.

Would you mind adding more details what is the exact reason and what scalar/array combinations are the most trouble?

this info might be important how DF treats bultin functions now

Basically, the make_scalar_function functions (there are three implementations, two of which are identical and one with slightly different behavior) convert scalar arguments into arrays. This adds unnecessary overhead in some cases because Arrow has specialized implementations of some kernels with fast paths for scalar arguments. For example, for contains, Arrow has pub fn contains(left: &dyn Datum, right: &dyn Datum) and Datum is implemented for arrays and scalars. The implementation has special handling for scalars.

The three implementations of make_scalar_function can be found in these files:

  • datafusion/functions/src/utils.rs
  • datafusion/functions-nested/src/utils.rs
  • datafusion/spark/src/function/functions_nested_utils.rs

@andygrove
Copy link
Member Author

Thanks @andygrove the numbers make huge sense to me, what prob concerns is DF using make_scalar_function in lots of places so the same problem can be relevant for others.
Would you mind adding more details what is the exact reason and what scalar/array combinations are the most trouble?
this info might be important how DF treats bultin functions now

Basically, the make_scalar_function functions (there are three implementations, two of which are identical and one with slightly different behavior) convert scalar arguments into arrays. This adds unnecessary overhead in some cases because Arrow has specialized implementations of some kernels with fast paths for scalar arguments. For example, for contains, Arrow has pub fn contains(left: &dyn Datum, right: &dyn Datum) and Datum is implemented for arrays and scalars. The implementation has special handling for scalars.

The three implementations of make_scalar_function can be found in these files:

* `datafusion/functions/src/utils.rs`

* `datafusion/functions-nested/src/utils.rs`

* `datafusion/spark/src/function/functions_nested_utils.rs`

The issue is that make_scalar_function expected ArrayRefs:

F: Fn(&[ArrayRef]) -> Result<ArrayRef>,                                                                                                                                                      

It would perhaps be better if it accepted ColumnarValue, which supports both arrays and scalars. This is probably a large refactor.

F: Fn(&[ColumnarValue]) -> Result<ArrayRef>,   

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants