Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Dec 28, 2025

Which issue does this PR close?

  • Closes #.

Rationale for this change

This PR is an alternative to #19514 that replaces the use of make_scalar_function with a new make_scalar_function_columnar that avoids expanding scalar values to arrays for each batch.

Benchmark Old Code New Code Improvement
contains_StringViewArray_scalar_strlen_8 ~97 µs ~34 µs 2.8x faster
contains_StringViewArray_scalar_strlen_32 ~175 µs ~37 µs 4.7x faster
contains_StringViewArray_scalar_strlen_128 ~332 µs ~42 µs 7.9x faster
contains_StringViewArray_scalar_strlen_512 ~371 µs ~88 µs 4.2x faster

What changes are included in this PR?

Are these changes tested?

Existing tests

Are there any user-facing changes?

No

@github-actions github-actions bot added the functions Changes to functions implementation label Dec 28, 2025
@andygrove andygrove changed the title perf: Optimize contains for scalar search arg (alternate approach) perf: Optimize contains for scalar search arg (using make_scalar_function_columnar) Dec 28, 2025
@andygrove andygrove marked this pull request as ready for review December 28, 2025 17:10
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove this makes a lot of sense

if let Some(coercion_data_type) =
string_coercion(args[0].data_type(), args[1].data_type()).or_else(|| {
binary_to_string_coercion(args[0].data_type(), args[1].data_type())
string_coercion(haystack.data_type(), needle.data_type()).or_else(|| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another potential optimizations is to call coercion/datatype stuff only once, rather than per every batch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look, and it didn't seem to make much difference to performance.

/// by using Arrow's `Datum` trait which has optimized paths for scalar arguments.
///
/// * `inner` - the function to be executed, receives `ColumnarValue` arguments directly
pub fn make_scalar_function_columnar<F>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a value add to this function; the point of make_scalar_function was to make it simpler for functions to implement based on only arrays without considering columnarvalues; that way they can opt into a manual implementation that does take into account columnarvalues. make_scalar_function_columnar here just seems to be a thin wrapper that doesn't do anything except call the passed in function? In which case it would be better for the UDF (contains) to just put the implementing code inside invoke itself (or have invoke call this passed in inner function)

Comment on lines 100 to 103
/// Converts a `ColumnarValue` to a value that implements `Datum` for use with arrow kernels.
/// If the value is a scalar, wraps the single-element array in `Scalar` to signal to arrow
/// that this is a scalar value (enabling optimized code paths).
fn columnar_to_datum(value: &ColumnarValue) -> Result<(ArrayRef, bool)> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring confusing here since we aren't doing the wrap in Scalar

Comment on lines +122 to +128
(false, false) => arrow_contains(haystack, needle)?,
(false, true) => arrow_contains(haystack, &Scalar::new(Arc::clone(needle)))?,
(true, false) => arrow_contains(&Scalar::new(Arc::clone(haystack)), needle)?,
(true, true) => arrow_contains(
&Scalar::new(Arc::clone(haystack)),
&Scalar::new(Arc::clone(needle)),
)?,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could implement Datum on ColumnerValue (or at least on ScalarValue), so we wouldn't need to do this check & wrapping logic in each function we optimize 🤔

@andygrove andygrove changed the title perf: Optimize contains for scalar search arg (using make_scalar_function_columnar) perf: Optimize contains for scalar search arg Dec 29, 2025
Copy link
Member

@rluvaton rluvaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rluvaton rluvaton added the performance Make DataFusion faster label Dec 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation performance Make DataFusion faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants