[Feature](function) support function array_combinations#60192
[Feature](function) support function array_combinations#60192daju233 wants to merge 2 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
8317b21 to
f9ff468
Compare
| return comb; | ||
| } | ||
|
|
||
| bool _next_combination(std::vector<size_t>& comb, Int64 k) const { |
There was a problem hiding this comment.
what's the meaning of I, j, k? dont use those meaningless identifier
|
|
||
| Status execute_impl(FunctionContext* context, Block& block, const ColumnNumbers& arguments, | ||
| uint32_t result, size_t input_rows_count) const override { | ||
| auto left = block.get_by_position(arguments[0]).column->convert_to_full_column_if_const(); |
There was a problem hiding this comment.
dont directly use convert_to_full_column_if_const, but vector_const...
| Status execute_impl(FunctionContext* context, Block& block, const ColumnNumbers& arguments, | ||
| uint32_t result, size_t input_rows_count) const override { | ||
| auto left = block.get_by_position(arguments[0]).column->convert_to_full_column_if_const(); | ||
| auto* src_arr = assert_cast<ColumnArray*>(remove_nullable(left)->assume_mutable().get()); |
There was a problem hiding this comment.
why need assume_mutable here?
| "array_combinations first argument must be Array"); | ||
| DataType itemType = ((ArrayType) arg0Type).getItemType(); | ||
| return FunctionSignature.ret(ArrayType.of(ArrayType.of(itemType))) | ||
| .args(getArgument(0).getDataType(), getArgument(1).getDataType()); |
There was a problem hiding this comment.
what if arg1 is not number?
regression-test/suites/query_p0/sql_functions/array_functions/array_combinations.groovy
Outdated
Show resolved
Hide resolved
|
|
||
| ColumnPtr _execute_combination(const ColumnArray* nested, size_t input_rows_count, | ||
| const ColumnArray::ColumnOffsets& offsets, Int64 k) const { | ||
| const auto& data_col = nested->get_data(); |
There was a problem hiding this comment.
re-consider all your var names
|
|
||
| std::vector comb = _first_combination(k, row_len); | ||
|
|
||
| for (int i = 0; i < static_cast<size_t>(k); ++i) { |
There was a problem hiding this comment.
why put a single same for-loop outside the while-loop? maybe you need a do-while?
| size_t curr_off = in_offs[row]; | ||
| size_t row_len = curr_off - prev_off; | ||
|
|
||
| if (k <= 0 || static_cast<size_t>(k) > row_len) { |
There was a problem hiding this comment.
add reg case and make sure this behaviour is same with target system
| for (int i = 0; i < static_cast<size_t>(k); ++i) { | ||
| size_t idx = prev_off + comb[i]; | ||
| data_col.get(idx, element); | ||
| inner->get_data().insert(element); |
There was a problem hiding this comment.
directly use insert_from could get rid of Field?
|
|
||
| ColumnPtr _execute_combination(const ColumnArray* nested, size_t input_rows_count, | ||
| const ColumnArray::ColumnOffsets& offsets, Int64 k) const { | ||
| const auto& data_col = nested->get_data(); |
There was a problem hiding this comment.
you can reserve for result column
| inner->reserve(inner->size() + _combination_count(row_len, combination_length)); | ||
| outer_off += _combination_count(row_len, combination_length); | ||
| if (outer_off > MAX_COMBINATION_COUNT) { | ||
| status = Status::RuntimeError( |
| uint32_t result, size_t input_rows_count) const override { | ||
| auto array = block.get_by_position(arguments[0]).column; | ||
| ColumnPtr num = | ||
| block.get_by_position(arguments[1]).column->convert_to_full_column_if_const(); |
There was a problem hiding this comment.
dont convert_to_full_column_if_const
| const auto& offsets = | ||
| assert_cast<const ColumnArray::ColumnOffsets&>(src_arr->get_offsets_column()); | ||
| Status error = Status::OK(); | ||
| vector_const(src_arr, input_rows_count, res, offsets, combination_length, error); |
There was a problem hiding this comment.
what if both are const? framework will pass both non-const to your function. maybe you should override get_arguments_that_are_always_constant
| return comb; | ||
| } | ||
|
|
||
| bool _next_combination(std::vector<size_t>& comb, Int64 combination_length) const { |
There was a problem hiding this comment.
add a comment to explain for this function
There was a problem hiding this comment.
explain why this function could generate next comb
regression-test/suites/query_p0/sql_functions/array_functions/array_combinations.groovy
Outdated
Show resolved
Hide resolved
| Preconditions.checkArgument(children.size() == 1); | ||
| return new QuantileStateToBase64(children.get(0)); | ||
| public void checkLegalityAfterRewrite() { | ||
| super.checkLegalityBeforeTypeCoercion(); |
There was a problem hiding this comment.
sorry,it's a mistake,I will fix it.
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| suite("array_combinations") { |
There was a problem hiding this comment.
should add more testcases, about empty column, f(literal, col), null, second arg>2 ...
| inner->reserve(inner->size() + _combination_count(row_len, combination_length)); | ||
| outer_off += _combination_count(row_len, combination_length); | ||
| if (outer_off > MAX_COMBINATION_COUNT) { | ||
| return Status::RuntimeError( |
There was a problem hiding this comment.
Should I do the same thing with MAX_COMBINATION_LENGTH in line 65?
| return comb; | ||
| } | ||
|
|
||
| bool _next_combination(std::vector<size_t>& comb, Int64 combination_length) const { |
There was a problem hiding this comment.
explain why this function could generate next comb
| @@ -10,7 +10,7 @@ | |||
| 0.7 | |||
| 0.8 | |||
| 0.9 | |||
| 1.0 | |||
There was a problem hiding this comment.
dont modify these. that's because of diff of jdk here. please revert these irrelevant results change before push your code
| std::vector comb = _first_combination(combination_length, row_len); | ||
| inner->reserve(inner->size() + _combination_count(row_len, combination_length)); | ||
| outer_off += _combination_count(row_len, combination_length); |
There was a problem hiding this comment.
seems weird to check on each block. Does trino also behaves this way or if it's just a per-row check?
if it's a per-row check, can we directly cache the maximum array_length for each specific k?
There was a problem hiding this comment.
I think trino checks every time when the function is called, I now modified in the latest issue to per-row check, does that look better now?
| size_t _combination_count(size_t array_length, size_t combination_length) const { | ||
| size_t combinations = 1; | ||
|
|
||
| for (int i = 1; i <= combination_length; i++) { | ||
| combinations = combinations * (array_length - combination_length + i) / i; | ||
| } | ||
|
|
||
| return combinations; | ||
| } |
There was a problem hiding this comment.
Can we pre-calculate the max array length corresponding to each k value, rather than calculating it for each row? Like:
static constexpr size_t MAX_COMBINATION_COUNT[6] = {-1, 100000, 500, ....}
if (array_lenth > MAX_COMBINATION_COUNT[combination_length]) {
...
}There was a problem hiding this comment.
it's ok, but I think combination_count also have other usage,like resize offset. So I dont know if pre-calculate can improve overhead
There was a problem hiding this comment.
better do that before call _combination_count, to avoid overflow
|
run buildall |
TPC-H: Total hot run time: 28764 ms |
TPC-DS: Total hot run time: 183941 ms |
FE UT Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
p0 regression has some issues |
| super(functionParams); | ||
| @Override | ||
| public void checkLegalityBeforeTypeCoercion() { | ||
| if (!(child(1) instanceof IntegerLikeLiteral)) { |
There was a problem hiding this comment.
| if (!(child(1) instanceof IntegerLikeLiteral)) { | |
| if (!(child(1) instanceof IntegerLikeLiteral || child(1) instanceof NullLiteral)) { |
NullLiteral should be recognized and return null
There was a problem hiding this comment.
Identify from the frontend and then return an empty column directly from backend?
There was a problem hiding this comment.
We shouldn't intercept NullLiteral here. processed by be, and each row should return NULL(not empty):
presto> select combinations(ARRAY[1,2,3], null);
_col0
-------
NULL
(1 row)
| block.replace_by_position(result, std::move(null_col)); | ||
| return Status::OK(); | ||
| } | ||
|
|
There was a problem hiding this comment.
no need to do additional process here, if the function use_default_implementation_for_nulls returns true(default), it will be processed automatically
|
please rebase master to solve conflicts, then we can proceed with merging this pr~ |
There was a problem hiding this comment.
add more test for other type(datetime, timestamptz....)
| auto nullable_arr = ColumnNullable::create(std::move(inner_arr), | ||
| ColumnUInt8::create(inner_arr->size(), 0)); |
There was a problem hiding this comment.
save ineer_size before create to avoid ub may cased by consumed std::move(inner_arr) first
| get_name())); | ||
| } | ||
| std::vector comb = _first_combination(combination_length, row_len); | ||
| inner->reserve(inner->size() + combination_count); |
There was a problem hiding this comment.
| inner->reserve(inner->size() + combination_count); | |
| inner->get_data().reserve(inner->get_date().size() + combination_count * combination_length); | |
| inner->get_offsets().reserve(inner->get_offsets().size() + cominbation_cout); |
|
/review |
There was a problem hiding this comment.
Code Review Summary: Feature support function array_combinations
Overview
This PR adds an array_combinations function that generates all k-combinations of elements from an input array. The implementation spans BE (C++), FE (Java), and regression tests.
Critical Issue Found
Use-after-move bug (Undefined Behavior) in function_array_combinations.cpp:178-179. The code calls std::move(inner_arr) in the first argument of ColumnNullable::create, then accesses inner_arr->size() in the second argument. In C++, function argument evaluation order is unspecified. If the move is evaluated first, inner_arr becomes null and inner_arr->size() is a null pointer dereference (UB). This can cause crashes depending on compiler/optimization.
Minor Issues
- Signed/unsigned comparison:
int ivssize_t combination_lengthin_combination_countloop (line 100) - Missing newline at end of
array_combinations.groovytest file
What Looks Good
- FE implementation correctly enforces the second argument is a constant literal
- The combinatorial algorithm is correct (multiplicative formula for C(n,k) with exact integer division)
- Good guard limits: MAX_COMBINATION_LENGTH=5, MAX_COMBINATION_COUNT=100,000
- Comprehensive test coverage including edge cases (empty arrays, nulls, nested arrays, out-of-range values)
- Proper use of
use_default_implementation_for_nullsandget_arguments_that_are_always_constant
| auto nullable_arr = ColumnNullable::create(std::move(inner_arr), | ||
| ColumnUInt8::create(inner_arr->size(), 0)); | ||
| res = ColumnArray::create(std::move(nullable_arr), std::move(outer_offsets)); | ||
|
|
There was a problem hiding this comment.
Bug (Undefined Behavior): Use-after-move
std::move(inner_arr) is in the first argument and inner_arr->size() is in the second argument of the same function call. C++ does not guarantee evaluation order of function arguments. If the move is evaluated first, inner_arr becomes null and ->size() is a null pointer dereference.
Fix:
auto inner_size = inner_arr->size();
auto nullable_arr = ColumnNullable::create(std::move(inner_arr),
ColumnUInt8::create(inner_size, 0));| for (int i = 1; i <= combination_length; i++) { | ||
| combinations = combinations * (array_length - combination_length + i) / i; | ||
| } | ||
|
|
There was a problem hiding this comment.
Nit: Signed/unsigned comparison
Loop variable i is int (signed) but combination_length is size_t (unsigned). This triggers -Wsign-compare warnings. Consider using size_t i instead:
for (size_t i = 1; i <= combination_length; i++) {|
/review |
What problem does this PR solve?
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)