Refactor regexp_count_inner into a unified row-processing pipeline while preserving behavior#22770
Refactor regexp_count_inner into a unified row-processing pipeline while preserving behavior#22770kosiew wants to merge 5 commits into
regexp_count_inner into a unified row-processing pipeline while preserving behavior#22770Conversation
…ndling - Replaced 8-arm scalar/array match with a single row loop. - Introduced private input sources: StringValueSource and StartValueSource. - Centralized length checks through validate_array_len function. - Preserved scalar NULL regex zero short-circuit behavior. - Maintained compile-once path for scalar regex and flags. - Retained cache mechanism for row-varying regex/flags. - Removed dependency on itertools::izip.
…ments - Added values_len - Replaced extension trait with string_value_opt - Introduced compile_scalar_pattern - Bound input value once per row for efficiency - Renamed private constructors: regex_arg to improve clarity, flags_arg for better understanding - Added comments to clarify scalar flags and start value (0) behavior
- Maintained old error ordering: - Scalar regex with scalar flags compiles before start length validation. - Scalar regex with array flags validates flags before start. - Regex array retains the order of regex/start/flags. - Implemented suggestion to use StartValueSource::Scalar(i64) instead of Option<i64>. - Added regression tests to verify error ordering.
- Updated regexp_count_inner to take S directly instead of &S. - Adjusted call sites to pass values.as_string() directly without an additional reference. - Changed flags_array type to Option<S> from Option<&S>. - Modified StringValueSource::Array to use S instead of &S. - Retained lifetime 'a solely for Arrow StringArrayType<'a> and for returning &str. - Confirmed that there are no behavioral changes.
|
run benchmark regexp_count |
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing duplication-complexity-02-22669 (4a2b88f) to e1d8d46 (merge-base) diff File an issue against this benchmark runner |
…rrow &values and &array - Updated string_value_opt to now take &S. - Modified call sites to borrow &values and &array. - Retained simplified regexp_count_inner(values: S, flags_array: Option<S>).
|
🤖 Criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark regexp_count |
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing duplication-complexity-02-22669 (c68d5d5) to e1d8d46 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
Which issue does this PR close?
Rationale for this change
regexp_count_innercurrently uses an 8-arm match over combinations of scalar and array inputs for regex, start position, and flags. Many of these branches duplicate the same logic for length validation, null handling, regex compilation/cache lookup, and match counting.This refactor reduces duplication and centralizes row processing while preserving existing SQL-visible behavior, error messages, error ordering, and regex cache usage.
What changes are included in this PR?
Replaced the large 8-arm scalar/array match in
regexp_count_innerwith a unified row-processing implementation.Added private helper abstractions:
StringValueSourcefor scalar-or-array string arguments.StartValueSourcefor scalar-or-array start arguments.string_value_optfor null-aware string access.validate_array_lenfor centralized length validation.compile_scalar_patternfor compiling reusable scalar regex/flags combinations once.Preserved existing scalar
NULLregex short-circuit behavior by returning a zero-filled result array before validating other arguments.Preserved regex cache reuse through
compile_and_cache_regex.Removed the dependency on
itertools::izipby consolidating processing into a single row loop.Kept outer type dispatch and public interfaces unchanged.
Are these changes tested?
Yes.
Added tests covering behavior that is sensitive to validation and error ordering:
test_regexp_count_error_order_invalid_scalar_regex_before_start_lentest_regexp_count_error_order_flags_len_before_start_lenExisting
regexp_counttests continue to run unchanged.Are there any user-facing changes?
No.
This change is an internal refactor intended to preserve existing behavior, error messages, validation ordering, and SQL-visible results.
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.