perf: Optimize `contains` for scalar search arg #19514

andygrove · 2025-12-27T19:05:33Z

Which issue does this PR close?

Closes #.

Rationale for this change

ContainsFunc used make_scalar_function, which expands scalar arguments into arrays before calling Arrow's contains function, bypassing Arrow's built-in scalar optimization.

After optimizing for the scalar search case, performance is much better:

contains_StringArray_scalar_search
                        time:   [40.105 µs 40.273 µs 40.448 µs]
                        change: [−84.713% −84.651% −84.593%] (p = 0.00 < 0.05)
                        Performance has improved.

contains_StringViewArray_scalar_search
                        time:   [41.692 µs 41.823 µs 41.953 µs]
                        change: [−88.150% −88.103% −88.053%] (p = 0.00 < 0.05)
                        Performance has improved.

What changes are included in this PR?

Rewrote invoke_with_args to:

Handle the four cases directly (scalar/scalar, array/scalar, scalar/array, array/array)
Use Arrow's Scalar wrapper when passing scalar arguments
Let Arrow's contains function use its optimized op_scalar path when the second argument is a scalar

Are these changes tested?

Are there any user-facing changes?

comphead

Thanks @andygrove the numbers make huge sense to me, what prob concerns is DF using make_scalar_function in lots of places so the same problem can be relevant for others.

Would you mind adding more details what is the exact reason and what scalar/array combinations are the most trouble?

this info might be important how DF treats bultin functions now

andygrove · 2025-12-28T15:59:11Z

Thanks @andygrove the numbers make huge sense to me, what prob concerns is DF using make_scalar_function in lots of places so the same problem can be relevant for others.

Would you mind adding more details what is the exact reason and what scalar/array combinations are the most trouble?

this info might be important how DF treats bultin functions now

Basically, the make_scalar_function functions (there are three implementations, two of which are identical and one with slightly different behavior) convert scalar arguments into arrays. This adds unnecessary overhead in some cases because Arrow has specialized implementations of some kernels with fast paths for scalar arguments. For example, for contains, Arrow has pub fn contains(left: &dyn Datum, right: &dyn Datum) and Datum is implemented for arrays and scalars. The implementation has special handling for scalars.

The three implementations of make_scalar_function can be found in these files:

datafusion/functions/src/utils.rs
datafusion/functions-nested/src/utils.rs
datafusion/spark/src/function/functions_nested_utils.rs

andygrove · 2025-12-28T16:23:05Z

Thanks @andygrove the numbers make huge sense to me, what prob concerns is DF using make_scalar_function in lots of places so the same problem can be relevant for others.
Would you mind adding more details what is the exact reason and what scalar/array combinations are the most trouble?
this info might be important how DF treats bultin functions now

Basically, the make_scalar_function functions (there are three implementations, two of which are identical and one with slightly different behavior) convert scalar arguments into arrays. This adds unnecessary overhead in some cases because Arrow has specialized implementations of some kernels with fast paths for scalar arguments. For example, for contains, Arrow has pub fn contains(left: &dyn Datum, right: &dyn Datum) and Datum is implemented for arrays and scalars. The implementation has special handling for scalars.

The three implementations of make_scalar_function can be found in these files:
* `datafusion/functions/src/utils.rs`

* `datafusion/functions-nested/src/utils.rs`

* `datafusion/spark/src/function/functions_nested_utils.rs`

The issue is that make_scalar_function expected ArrayRefs:

F: Fn(&[ArrayRef]) -> Result<ArrayRef>,

It would perhaps be better if it accepted ColumnarValue, which supports both arrays and scalars. This is probably a large refactor.

F: Fn(&[ColumnarValue]) -> Result<ArrayRef>,

optimize contains for scalar search arg

86f9e71

github-actions bot added the functions Changes to functions implementation label Dec 27, 2025

andygrove marked this pull request as draft December 27, 2025 19:09

andygrove marked this pull request as ready for review December 27, 2025 19:12

andygrove requested review from Dandandan, Jefffrey and comphead December 27, 2025 19:12

comphead reviewed Dec 28, 2025

View reviewed changes

Jefffrey approved these changes Dec 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Optimize `contains` for scalar search arg #19514

perf: Optimize `contains` for scalar search arg #19514

andygrove commented Dec 27, 2025 •

edited

Loading

Uh oh!

comphead left a comment

Uh oh!

andygrove commented Dec 28, 2025

Uh oh!

andygrove commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

perf: Optimize contains for scalar search arg #19514

Are you sure you want to change the base?

perf: Optimize contains for scalar search arg #19514

Conversation

andygrove commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

andygrove commented Dec 28, 2025

Uh oh!

andygrove commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

perf: Optimize `contains` for scalar search arg #19514

perf: Optimize `contains` for scalar search arg #19514

andygrove commented Dec 27, 2025 •

edited

Loading