-
Notifications
You must be signed in to change notification settings - Fork 1.9k
perf: Optimize contains for scalar search arg
#19529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
contains for scalar search arg (using make_scalar_function_columnar)
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andygrove this makes a lot of sense
| if let Some(coercion_data_type) = | ||
| string_coercion(args[0].data_type(), args[1].data_type()).or_else(|| { | ||
| binary_to_string_coercion(args[0].data_type(), args[1].data_type()) | ||
| string_coercion(haystack.data_type(), needle.data_type()).or_else(|| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another potential optimizations is to call coercion/datatype stuff only once, rather than per every batch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick look, and it didn't seem to make much difference to performance.
… into faster-contains-alt
datafusion/functions/src/utils.rs
Outdated
| /// by using Arrow's `Datum` trait which has optimized paths for scalar arguments. | ||
| /// | ||
| /// * `inner` - the function to be executed, receives `ColumnarValue` arguments directly | ||
| pub fn make_scalar_function_columnar<F>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a value add to this function; the point of make_scalar_function was to make it simpler for functions to implement based on only arrays without considering columnarvalues; that way they can opt into a manual implementation that does take into account columnarvalues. make_scalar_function_columnar here just seems to be a thin wrapper that doesn't do anything except call the passed in function? In which case it would be better for the UDF (contains) to just put the implementing code inside invoke itself (or have invoke call this passed in inner function)
| /// Converts a `ColumnarValue` to a value that implements `Datum` for use with arrow kernels. | ||
| /// If the value is a scalar, wraps the single-element array in `Scalar` to signal to arrow | ||
| /// that this is a scalar value (enabling optimized code paths). | ||
| fn columnar_to_datum(value: &ColumnarValue) -> Result<(ArrayRef, bool)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring confusing here since we aren't doing the wrap in Scalar
| (false, false) => arrow_contains(haystack, needle)?, | ||
| (false, true) => arrow_contains(haystack, &Scalar::new(Arc::clone(needle)))?, | ||
| (true, false) => arrow_contains(&Scalar::new(Arc::clone(haystack)), needle)?, | ||
| (true, true) => arrow_contains( | ||
| &Scalar::new(Arc::clone(haystack)), | ||
| &Scalar::new(Arc::clone(needle)), | ||
| )?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could implement Datum on ColumnerValue (or at least on ScalarValue), so we wouldn't need to do this check & wrapping logic in each function we optimize 🤔
contains for scalar search arg (using make_scalar_function_columnar)contains for scalar search arg
rluvaton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Which issue does this PR close?
Rationale for this change
This PR is an alternative to #19514 that replaces the use of
make_scalar_functionwith a newmake_scalar_function_columnarthat avoids expanding scalar values to arrays for each batch.What changes are included in this PR?
Are these changes tested?
Existing tests
Are there any user-facing changes?
No