Skip to content

[Feature](function) support function array_combinations#60192

Open
daju233 wants to merge 2 commits intoapache:masterfrom
daju233:dev-daju233
Open

[Feature](function) support function array_combinations#60192
daju233 wants to merge 2 commits intoapache:masterfrom
daju233:dev-daju233

Conversation

@daju233
Copy link

@daju233 daju233 commented Jan 23, 2026

What problem does this PR solve?

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@daju233 daju233 requested a review from zclllyybb as a code owner January 23, 2026 11:21
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@daju233 daju233 force-pushed the dev-daju233 branch 3 times, most recently from 8317b21 to f9ff468 Compare January 23, 2026 11:50
@zclllyybb zclllyybb self-assigned this Jan 23, 2026
return comb;
}

bool _next_combination(std::vector<size_t>& comb, Int64 k) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the meaning of I, j, k? dont use those meaningless identifier


Status execute_impl(FunctionContext* context, Block& block, const ColumnNumbers& arguments,
uint32_t result, size_t input_rows_count) const override {
auto left = block.get_by_position(arguments[0]).column->convert_to_full_column_if_const();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont directly use convert_to_full_column_if_const, but vector_const...

Status execute_impl(FunctionContext* context, Block& block, const ColumnNumbers& arguments,
uint32_t result, size_t input_rows_count) const override {
auto left = block.get_by_position(arguments[0]).column->convert_to_full_column_if_const();
auto* src_arr = assert_cast<ColumnArray*>(remove_nullable(left)->assume_mutable().get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need assume_mutable here?

"array_combinations first argument must be Array");
DataType itemType = ((ArrayType) arg0Type).getItemType();
return FunctionSignature.ret(ArrayType.of(ArrayType.of(itemType)))
.args(getArgument(0).getDataType(), getArgument(1).getDataType());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if arg1 is not number?


ColumnPtr _execute_combination(const ColumnArray* nested, size_t input_rows_count,
const ColumnArray::ColumnOffsets& offsets, Int64 k) const {
const auto& data_col = nested->get_data();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-consider all your var names


std::vector comb = _first_combination(k, row_len);

for (int i = 0; i < static_cast<size_t>(k); ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why put a single same for-loop outside the while-loop? maybe you need a do-while?

size_t curr_off = in_offs[row];
size_t row_len = curr_off - prev_off;

if (k <= 0 || static_cast<size_t>(k) > row_len) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add reg case and make sure this behaviour is same with target system

for (int i = 0; i < static_cast<size_t>(k); ++i) {
size_t idx = prev_off + comb[i];
data_col.get(idx, element);
inner->get_data().insert(element);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

directly use insert_from could get rid of Field?


ColumnPtr _execute_combination(const ColumnArray* nested, size_t input_rows_count,
const ColumnArray::ColumnOffsets& offsets, Int64 k) const {
const auto& data_col = nested->get_data();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can reserve for result column

inner->reserve(inner->size() + _combination_count(row_len, combination_length));
outer_off += _combination_count(row_len, combination_length);
if (outer_off > MAX_COMBINATION_COUNT) {
status = Status::RuntimeError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

directly return Status.

uint32_t result, size_t input_rows_count) const override {
auto array = block.get_by_position(arguments[0]).column;
ColumnPtr num =
block.get_by_position(arguments[1]).column->convert_to_full_column_if_const();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont convert_to_full_column_if_const

const auto& offsets =
assert_cast<const ColumnArray::ColumnOffsets&>(src_arr->get_offsets_column());
Status error = Status::OK();
vector_const(src_arr, input_rows_count, res, offsets, combination_length, error);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if both are const? framework will pass both non-const to your function. maybe you should override get_arguments_that_are_always_constant

return comb;
}

bool _next_combination(std::vector<size_t>& comb, Int64 combination_length) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment to explain for this function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain why this function could generate next comb

Preconditions.checkArgument(children.size() == 1);
return new QuantileStateToBase64(children.get(0));
public void checkLegalityAfterRewrite() {
super.checkLegalityBeforeTypeCoercion();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why super?

Copy link
Author

@daju233 daju233 Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry,it's a mistake,I will fix it.

// specific language governing permissions and limitations
// under the License.

suite("array_combinations") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should add more testcases, about empty column, f(literal, col), null, second arg>2 ...

inner->reserve(inner->size() + _combination_count(row_len, combination_length));
outer_off += _combination_count(row_len, combination_length);
if (outer_off > MAX_COMBINATION_COUNT) {
return Status::RuntimeError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InvalidArgument instead

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I do the same thing with MAX_COMBINATION_LENGTH in line 65?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

return comb;
}

bool _next_combination(std::vector<size_t>& comb, Int64 combination_length) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain why this function could generate next comb

@@ -10,7 +10,7 @@
0.7
0.8
0.9
1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont modify these. that's because of diff of jdk here. please revert these irrelevant results change before push your code

Comment on lines +155 to +157
std::vector comb = _first_combination(combination_length, row_len);
inner->reserve(inner->size() + _combination_count(row_len, combination_length));
outer_off += _combination_count(row_len, combination_length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems weird to check on each block. Does trino also behaves this way or if it's just a per-row check?

if it's a per-row check, can we directly cache the maximum array_length for each specific k?

Copy link
Author

@daju233 daju233 Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think trino checks every time when the function is called, I now modified in the latest issue to per-row check, does that look better now?

Comment on lines +93 to +101
size_t _combination_count(size_t array_length, size_t combination_length) const {
size_t combinations = 1;

for (int i = 1; i <= combination_length; i++) {
combinations = combinations * (array_length - combination_length + i) / i;
}

return combinations;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pre-calculate the max array length corresponding to each k value, rather than calculating it for each row? Like:

static constexpr size_t MAX_COMBINATION_COUNT[6] = {-1, 100000, 500, ....}
if (array_lenth > MAX_COMBINATION_COUNT[combination_length]) {
    ...
}

Copy link
Author

@daju233 daju233 Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's ok, but I think combination_count also have other usage,like resize offset. So I dont know if pre-calculate can improve overhead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better do that before call _combination_count, to avoid overflow

@linrrzqqq
Copy link
Contributor

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 28764 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit d148f704a38abfcb0e9375a78bccf6e170bf6a60, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17655	4451	4290	4290
q2	q3	10646	758	511	511
q4	4677	364	258	258
q5	7545	1205	1019	1019
q6	169	173	146	146
q7	776	835	693	693
q8	9299	1406	1349	1349
q9	4834	4788	4668	4668
q10	6749	1875	1625	1625
q11	480	269	243	243
q12	721	564	462	462
q13	17748	4217	3416	3416
q14	237	242	209	209
q15	915	828	802	802
q16	726	721	662	662
q17	721	864	431	431
q18	5873	5434	5240	5240
q19	1256	973	597	597
q20	505	486	386	386
q21	4896	1905	1501	1501
q22	363	346	256	256
Total cold run time: 96791 ms
Total hot run time: 28764 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4566	4558	4542	4542
q2	q3	1792	2201	1776	1776
q4	854	1219	764	764
q5	4020	4374	4322	4322
q6	183	172	140	140
q7	1763	1668	1542	1542
q8	2471	2805	2563	2563
q9	7451	7329	7489	7329
q10	2668	2847	2394	2394
q11	500	434	448	434
q12	514	605	456	456
q13	3977	4443	3601	3601
q14	278	300	264	264
q15	851	824	838	824
q16	730	764	714	714
q17	1169	1539	1284	1284
q18	6993	6732	6721	6721
q19	887	1030	988	988
q20	2040	2142	1983	1983
q21	3879	3489	3417	3417
q22	445	433	371	371
Total cold run time: 48031 ms
Total hot run time: 46429 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 183941 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit d148f704a38abfcb0e9375a78bccf6e170bf6a60, data reload: false

query5	5330	650	545	545
query6	322	214	202	202
query7	4238	482	281	281
query8	341	251	230	230
query9	8710	2726	2726	2726
query10	528	383	347	347
query11	16955	17553	17262	17262
query12	200	155	134	134
query13	1671	512	380	380
query14	7172	3327	3089	3089
query14_1	2908	2920	2874	2874
query15	216	209	183	183
query16	1063	497	490	490
query17	1276	780	684	684
query18	2730	448	352	352
query19	217	207	178	178
query20	140	131	124	124
query21	217	140	115	115
query22	4890	4921	4873	4873
query23	17229	16784	16595	16595
query23_1	16652	16640	16729	16640
query24	7149	1623	1261	1261
query24_1	1226	1243	1229	1229
query25	567	493	445	445
query26	1241	257	149	149
query27	2757	469	281	281
query28	4494	1906	1857	1857
query29	786	567	470	470
query30	313	245	213	213
query31	876	741	641	641
query32	81	68	75	68
query33	522	343	269	269
query34	906	904	561	561
query35	622	668	592	592
query36	1099	1085	980	980
query37	144	107	97	97
query38	2910	2825	2864	2825
query39	881	872	849	849
query39_1	821	824	823	823
query40	231	154	136	136
query41	64	59	59	59
query42	108	111	106	106
query43	378	378	349	349
query44	
query45	200	189	180	180
query46	889	979	616	616
query47	2098	2122	2040	2040
query48	314	324	231	231
query49	620	458	417	417
query50	679	281	212	212
query51	4127	4106	4028	4028
query52	105	109	100	100
query53	291	334	279	279
query54	288	267	257	257
query55	85	80	84	80
query56	306	302	304	302
query57	1360	1366	1272	1272
query58	284	274	276	274
query59	2612	2609	2490	2490
query60	340	340	312	312
query61	153	154	156	154
query62	612	594	543	543
query63	312	277	279	277
query64	4843	1262	1001	1001
query65	
query66	1389	460	353	353
query67	16266	16411	16300	16300
query68	
query69	418	308	281	281
query70	997	991	957	957
query71	332	304	302	302
query72	2795	2636	2521	2521
query73	556	552	322	322
query74	9983	9935	9823	9823
query75	2838	2745	2475	2475
query76	2305	1028	674	674
query77	376	401	337	337
query78	11166	11465	10666	10666
query79	1375	793	603	603
query80	1397	613	537	537
query81	567	274	248	248
query82	1027	150	116	116
query83	343	259	239	239
query84	255	115	97	97
query85	929	498	445	445
query86	426	318	295	295
query87	3096	3094	2988	2988
query88	3579	2687	2664	2664
query89	430	373	344	344
query90	1944	170	173	170
query91	165	164	136	136
query92	70	78	75	75
query93	1059	852	503	503
query94	654	337	289	289
query95	583	406	318	318
query96	654	534	237	237
query97	2442	2525	2387	2387
query98	231	226	216	216
query99	1006	978	949	949
Total cold run time: 254732 ms
Total hot run time: 183941 ms

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 6.25% (1/16) 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 6.73% (7/104) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.56% (19653/37391)
Line Coverage 36.18% (183480/507071)
Region Coverage 32.52% (142488/438134)
Branch Coverage 33.45% (61771/184672)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 93.27% (97/104) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.32% (26845/36613)
Line Coverage 56.64% (286318/505518)
Region Coverage 54.20% (239705/442271)
Branch Coverage 55.71% (103199/185236)

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 93.75% (15/16) 🎉
Increment coverage report
Complete coverage report

@linrrzqqq
Copy link
Contributor

p0 regression has some issues

   qt_test_null """
^^^^^^^^^^^^^^^^^^^^^^^^^^ERROR LINE^^^^^^^^^^^^^^^^^^^^^^^^^^
    select array_combinations(null, 2), array_combinations([1,2,3], null);
    """
    
    test {
        sql """select k1, array_combinations(['x','y','z'], k1) from t_array_combinations where k1 <= 3 order by k1;"""
        exception("Array_Combinations's second argument must be a constant literal.")
    }
    
    qt_test_param_3 """
    select array_combinations(['a','b','c','d'], 3), array_combinations([1,2,3,4], 3);

Exception:
java.sql.SQLException: errCode = 2, detailMessage = Array_Combinations's second argument must be a constant literal.
  at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:129)

@daju233 daju233 closed this Mar 7, 2026
@daju233 daju233 reopened this Mar 7, 2026
super(functionParams);
@Override
public void checkLegalityBeforeTypeCoercion() {
if (!(child(1) instanceof IntegerLikeLiteral)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!(child(1) instanceof IntegerLikeLiteral)) {
if (!(child(1) instanceof IntegerLikeLiteral || child(1) instanceof NullLiteral)) {

NullLiteral should be recognized and return null

Copy link
Author

@daju233 daju233 Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identify from the frontend and then return an empty column directly from backend?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't intercept NullLiteral here. processed by be, and each row should return NULL(not empty):

presto> select combinations(ARRAY[1,2,3], null);
 _col0 
-------
 NULL  
(1 row)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it

block.replace_by_position(result, std::move(null_col));
return Status::OK();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to do additional process here, if the function use_default_implementation_for_nulls returns true(default), it will be processed automatically

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solved it

@linrrzqqq
Copy link
Contributor

please rebase master to solve conflicts, then we can proceed with merging this pr~

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add more test for other type(datetime, timestamptz....)

Comment on lines +178 to +179
auto nullable_arr = ColumnNullable::create(std::move(inner_arr),
ColumnUInt8::create(inner_arr->size(), 0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save ineer_size before create to avoid ub may cased by consumed std::move(inner_arr) first

get_name()));
}
std::vector comb = _first_combination(combination_length, row_len);
inner->reserve(inner->size() + combination_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inner->reserve(inner->size() + combination_count);
inner->get_data().reserve(inner->get_date().size() + combination_count * combination_length);
inner->get_offsets().reserve(inner->get_offsets().size() + cominbation_cout);

@zclllyybb
Copy link
Contributor

/review

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary: Feature support function array_combinations

Overview

This PR adds an array_combinations function that generates all k-combinations of elements from an input array. The implementation spans BE (C++), FE (Java), and regression tests.

Critical Issue Found

Use-after-move bug (Undefined Behavior) in function_array_combinations.cpp:178-179. The code calls std::move(inner_arr) in the first argument of ColumnNullable::create, then accesses inner_arr->size() in the second argument. In C++, function argument evaluation order is unspecified. If the move is evaluated first, inner_arr becomes null and inner_arr->size() is a null pointer dereference (UB). This can cause crashes depending on compiler/optimization.

Minor Issues

  • Signed/unsigned comparison: int i vs size_t combination_length in _combination_count loop (line 100)
  • Missing newline at end of array_combinations.groovy test file

What Looks Good

  • FE implementation correctly enforces the second argument is a constant literal
  • The combinatorial algorithm is correct (multiplicative formula for C(n,k) with exact integer division)
  • Good guard limits: MAX_COMBINATION_LENGTH=5, MAX_COMBINATION_COUNT=100,000
  • Comprehensive test coverage including edge cases (empty arrays, nulls, nested arrays, out-of-range values)
  • Proper use of use_default_implementation_for_nulls and get_arguments_that_are_always_constant

auto nullable_arr = ColumnNullable::create(std::move(inner_arr),
ColumnUInt8::create(inner_arr->size(), 0));
res = ColumnArray::create(std::move(nullable_arr), std::move(outer_offsets));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug (Undefined Behavior): Use-after-move

std::move(inner_arr) is in the first argument and inner_arr->size() is in the second argument of the same function call. C++ does not guarantee evaluation order of function arguments. If the move is evaluated first, inner_arr becomes null and ->size() is a null pointer dereference.

Fix:

auto inner_size = inner_arr->size();
auto nullable_arr = ColumnNullable::create(std::move(inner_arr),
                                           ColumnUInt8::create(inner_size, 0));

for (int i = 1; i <= combination_length; i++) {
combinations = combinations * (array_length - combination_length + i) / i;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Signed/unsigned comparison

Loop variable i is int (signed) but combination_length is size_t (unsigned). This triggers -Wsign-compare warnings. Consider using size_t i instead:

for (size_t i = 1; i <= combination_length; i++) {

@daju233
Copy link
Author

daju233 commented Mar 20, 2026

/review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants