Skip to content

GH-49677 [Python][C++][Compute] Add search sorted compute kernel#49679

Open
Alex-PLACET wants to merge 8 commits intoapache:mainfrom
Alex-PLACET:add_search_sorted_compute_kernel
Open

GH-49677 [Python][C++][Compute] Add search sorted compute kernel#49679
Alex-PLACET wants to merge 8 commits intoapache:mainfrom
Alex-PLACET:add_search_sorted_compute_kernel

Conversation

@Alex-PLACET
Copy link
Copy Markdown

@Alex-PLACET Alex-PLACET commented Apr 7, 2026

Rationale for this change

Add the implemenation of the search sorted compute kernel based on the numpy function: https://numpy.org/doc/stable/reference/generated/numpy.searchsorted.html

What changes are included in this PR?

Implementation of the C++ kernel + Python API.
Tests in C++ and Python

Are these changes tested?

Yes

Are there any user-facing changes?

No breaking change

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

⚠️ GitHub issue #49677 has been automatically assigned in GitHub to PR creator.

- Added a new benchmark file `vector_search_sorted_benchmark.cc` to evaluate the performance of the SearchSorted function for various data types including Int64, String, and Binary.
- Created a comprehensive test suite in `vector_search_sorted_test.cc` to validate the correctness of SearchSorted across different scenarios, including handling of null values, scalar needles, and run-end encoded arrays.
- Ensured that the benchmarks cover both left and right search options, as well as edge cases like empty arrays and arrays with leading/trailing nulls.
…lize ranges for leading/trailing null counts
@Alex-PLACET Alex-PLACET force-pushed the add_search_sorted_compute_kernel branch from 8e09ea3 to 4ec630e Compare April 8, 2026 08:15
Comment on lines +53 to +58
[
0,
0,
3,
5
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[
0,
0,
3,
5
]
[
0,
0,
3,
5
]

@Alex-PLACET just noticing that the elements here require two space chars to be removed. Similarly for the next two example outputs.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 8, 2026
Copy link
Copy Markdown
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the implementation yet, but have reviewed the tests and benchmarks (which are quite comprehensive, thank you!).

One missing item is support for chunked arrays. Besides that, see comments below :)

SearchSortedOptions(SearchSortedOptions::Left)));
ASSERT_OK_AND_ASSIGN(auto right,
SearchSorted(Datum(values), Datum(needles),
SearchSortedOptions(SearchSortedOptions::Right)));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call ValidateFull() on both results?

SearchSortedOptions(SearchSortedOptions::Left)));
ASSERT_OK_AND_ASSIGN(auto right,
SearchSorted(Datum(values), Datum(needles),
SearchSortedOptions(SearchSortedOptions::Right)));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call ValidateFull on both results here too?

(also I'm curious, why not reuse CheckSimpleSearchSorted?)

Comment on lines +90 to +92
std::string scalar_needle_json;
uint64_t expected_scalar_left;
uint64_t expected_scalar_right;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that you could also generate the scalar needle tests automatically by calling GetScalar on the array needles and the expected results. This would make this easier to maintain later.


AssertArraysEqual(*ArrayFromJSON(uint64(), "[0, 1, 3, 4]"), *result.make_array());
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add tests for chunked arrays? (and, if not currently supported, support them :-))

return std::static_pointer_cast<BinaryArray>(builder.Finish().ValueOrDie());
}

std::shared_ptr<BinaryArray> BuildBinaryNeedles(int64_t size_bytes) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's worth benchmarking both string and binary, as they are expected to perform similarly.


void SetSearchSortedArgs(benchmark::internal::Benchmark* bench) {
bench->Unit(benchmark::kMicrosecond);
for (const auto size : kMemorySizes) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the largest size produces rather slow benchmark iterations, can we just keep {kL1Size, kL2Size}}?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which essentially means that all benchmarks become "quick" as per the definition of quick here :)

RunSearchSortedBenchmark(state, values, needles, side);
}

static void BM_SearchSortedInt64ScalarNeedle(benchmark::State& state,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is just one needle to search for, we're just benchmarking the function call overhead rather than any significant part of the sorted search kernel, right? Is it useful?

(benchmarks for other compute functions focus on array performance, not scalar performance)

->Apply(SetSearchSortedArgs);

// String and binary scalar cases specifically exercise the direct scalar fast path that
// avoids boxing a scalar needle into a temporary one-element array.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we wanted to benchmark this (which I doubt we do), we would only need a single benchmark IMHO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants