Skip to content

Refactor regexp_count_inner into a unified row-processing pipeline while preserving behavior#22770

Open
kosiew wants to merge 5 commits into
apache:mainfrom
kosiew:duplication-complexity-02-22669
Open

Refactor regexp_count_inner into a unified row-processing pipeline while preserving behavior#22770
kosiew wants to merge 5 commits into
apache:mainfrom
kosiew:duplication-complexity-02-22669

Conversation

@kosiew
Copy link
Copy Markdown
Contributor

@kosiew kosiew commented Jun 5, 2026

Which issue does this PR close?

Rationale for this change

regexp_count_inner currently uses an 8-arm match over combinations of scalar and array inputs for regex, start position, and flags. Many of these branches duplicate the same logic for length validation, null handling, regex compilation/cache lookup, and match counting.

This refactor reduces duplication and centralizes row processing while preserving existing SQL-visible behavior, error messages, error ordering, and regex cache usage.

What changes are included in this PR?

  • Replaced the large 8-arm scalar/array match in regexp_count_inner with a unified row-processing implementation.

  • Added private helper abstractions:

    • StringValueSource for scalar-or-array string arguments.
    • StartValueSource for scalar-or-array start arguments.
    • string_value_opt for null-aware string access.
    • validate_array_len for centralized length validation.
    • compile_scalar_pattern for compiling reusable scalar regex/flags combinations once.
  • Preserved existing scalar NULL regex short-circuit behavior by returning a zero-filled result array before validating other arguments.

  • Preserved regex cache reuse through compile_and_cache_regex.

  • Removed the dependency on itertools::izip by consolidating processing into a single row loop.

  • Kept outer type dispatch and public interfaces unchanged.

Are these changes tested?

Yes.

Added tests covering behavior that is sensitive to validation and error ordering:

  • test_regexp_count_error_order_invalid_scalar_regex_before_start_len
  • test_regexp_count_error_order_flags_len_before_start_len

Existing regexp_count tests continue to run unchanged.

Are there any user-facing changes?

No.

This change is an internal refactor intended to preserve existing behavior, error messages, validation ordering, and SQL-visible results.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

kosiew added 3 commits June 5, 2026 13:22
…ndling

- Replaced 8-arm scalar/array match with a single row loop.
- Introduced private input sources: StringValueSource and StartValueSource.
- Centralized length checks through validate_array_len function.
- Preserved scalar NULL regex zero short-circuit behavior.
- Maintained compile-once path for scalar regex and flags.
- Retained cache mechanism for row-varying regex/flags.
- Removed dependency on itertools::izip.
…ments

- Added values_len
- Replaced extension trait with string_value_opt
- Introduced compile_scalar_pattern
- Bound input value once per row for efficiency
- Renamed private constructors: regex_arg to improve clarity, flags_arg for better understanding
- Added comments to clarify scalar flags and start value (0) behavior
- Maintained old error ordering:
- Scalar regex with scalar flags compiles before start length validation.
- Scalar regex with array flags validates flags before start.
- Regex array retains the order of regex/start/flags.

- Implemented suggestion to use StartValueSource::Scalar(i64) instead of Option<i64>.

- Added regression tests to verify error ordering.
@github-actions github-actions Bot added the functions Changes to functions implementation label Jun 5, 2026
- Updated regexp_count_inner to take S directly instead of &S.
- Adjusted call sites to pass values.as_string() directly without an additional reference.
- Changed flags_array type to Option<S> from Option<&S>.
- Modified StringValueSource::Array to use S instead of &S.
- Retained lifetime 'a solely for Arrow StringArrayType<'a> and for returning &str.
- Confirmed that there are no behavioral changes.
@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented Jun 5, 2026

run benchmark regexp_count

@adriangbot
Copy link
Copy Markdown

🤖 Criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4628610409-447-62nbp 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing duplication-complexity-02-22669 (4a2b88f) to e1d8d46 (merge-base) diff
BENCH_NAME=regexp_count
BENCH_COMMAND=cargo bench --features=parquet --bench regexp_count
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

…rrow &values and &array

- Updated string_value_opt to now take &S.
- Modified call sites to borrow &values and &array.
- Retained simplified regexp_count_inner(values: S, flags_array: Option<S>).
@adriangbot
Copy link
Copy Markdown

🤖 Criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                                                      duplication-complexity-02-22669        main
-----                                                                      -------------------------------        ----
regexp_count size=1024/regexp_count_no_start [size=1024, str_len=128]      1.01    100.9±0.26µs        ? ?/sec    1.00     99.4±0.48µs        ? ?/sec
regexp_count size=1024/regexp_count_no_start [size=1024, str_len=32]       1.03     79.0±0.04µs        ? ?/sec    1.00     76.5±0.03µs        ? ?/sec
regexp_count size=1024/regexp_count_with_start [size=1024, str_len=128]    1.00    100.1±0.52µs        ? ?/sec    1.04    104.1±0.83µs        ? ?/sec
regexp_count size=1024/regexp_count_with_start [size=1024, str_len=32]     1.00     77.4±0.12µs        ? ?/sec    1.00     77.3±0.38µs        ? ?/sec
regexp_count size=4096/regexp_count_no_start [size=4096, str_len=128]      1.02    440.9±1.04µs        ? ?/sec    1.00    431.4±0.93µs        ? ?/sec
regexp_count size=4096/regexp_count_no_start [size=4096, str_len=32]       1.01    312.9±0.17µs        ? ?/sec    1.00    308.4±0.42µs        ? ?/sec
regexp_count size=4096/regexp_count_with_start [size=4096, str_len=128]    1.01    451.2±1.79µs        ? ?/sec    1.00    445.3±2.89µs        ? ?/sec
regexp_count size=4096/regexp_count_with_start [size=4096, str_len=32]     1.00    305.5±0.85µs        ? ?/sec    1.00    304.6±1.39µs        ? ?/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 120.0s
Peak memory 4.3 GiB
Avg memory 4.3 GiB
CPU user 128.7s
CPU sys 0.8s
Peak spill 0 B

branch

Metric Value
Wall time 120.0s
Peak memory 4.3 GiB
Avg memory 4.3 GiB
CPU user 129.8s
CPU sys 0.2s
Peak spill 0 B

File an issue against this benchmark runner

@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented Jun 5, 2026

run benchmark regexp_count

@adriangbot
Copy link
Copy Markdown

🤖 Criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4628797823-449-lkpr7 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing duplication-complexity-02-22669 (c68d5d5) to e1d8d46 (merge-base) diff
BENCH_NAME=regexp_count
BENCH_COMMAND=cargo bench --features=parquet --bench regexp_count
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                                                      duplication-complexity-02-22669        main
-----                                                                      -------------------------------        ----
regexp_count size=1024/regexp_count_no_start [size=1024, str_len=128]      1.00     99.3±0.67µs        ? ?/sec    1.01    100.0±0.12µs        ? ?/sec
regexp_count size=1024/regexp_count_no_start [size=1024, str_len=32]       1.01     78.1±0.40µs        ? ?/sec    1.00     77.2±0.02µs        ? ?/sec
regexp_count size=1024/regexp_count_with_start [size=1024, str_len=128]    1.00     99.8±0.81µs        ? ?/sec    1.03    102.9±0.30µs        ? ?/sec
regexp_count size=1024/regexp_count_with_start [size=1024, str_len=32]     1.00     77.0±0.22µs        ? ?/sec    1.01     77.5±0.04µs        ? ?/sec
regexp_count size=4096/regexp_count_no_start [size=4096, str_len=128]      1.00    436.4±1.37µs        ? ?/sec    1.00    437.7±0.79µs        ? ?/sec
regexp_count size=4096/regexp_count_no_start [size=4096, str_len=32]       1.01    308.6±1.68µs        ? ?/sec    1.00    306.0±0.05µs        ? ?/sec
regexp_count size=4096/regexp_count_with_start [size=4096, str_len=128]    1.01    448.2±0.63µs        ? ?/sec    1.00    445.7±0.71µs        ? ?/sec
regexp_count size=4096/regexp_count_with_start [size=4096, str_len=32]     1.00    303.6±0.56µs        ? ?/sec    1.01    306.0±0.28µs        ? ?/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 120.0s
Peak memory 4.3 GiB
Avg memory 4.3 GiB
CPU user 131.8s
CPU sys 0.8s
Peak spill 0 B

branch

Metric Value
Wall time 115.0s
Peak memory 4.3 GiB
Avg memory 4.3 GiB
CPU user 128.3s
CPU sys 0.2s
Peak spill 0 B

File an issue against this benchmark runner

@kosiew kosiew marked this pull request as ready for review June 5, 2026 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants