Refactor `regexp_count_inner` into a unified row-processing pipeline while preserving behavior by kosiew · Pull Request #22770 · apache/datafusion

kosiew · 2026-06-05T05:25:51Z

Which issue does this PR close?

Part of Consolidate regexp_count_inner Scalar/Array Branch Handling #22669

Rationale for this change

regexp_count_inner currently uses an 8-arm match over combinations of scalar and array inputs for regex, start position, and flags. Many of these branches duplicate the same logic for length validation, null handling, regex compilation/cache lookup, and match counting.

This refactor reduces duplication and centralizes row processing while preserving existing SQL-visible behavior, error messages, error ordering, and regex cache usage.

What changes are included in this PR?

Replaced the large 8-arm scalar/array match in regexp_count_inner with a unified row-processing implementation.
Added private helper abstractions:
- StringValueSource for scalar-or-array string arguments.
- StartValueSource for scalar-or-array start arguments.
- string_value_opt for null-aware string access.
- validate_array_len for centralized length validation.
- compile_scalar_pattern for compiling reusable scalar regex/flags combinations once.
Preserved existing scalar NULL regex short-circuit behavior by returning a zero-filled result array before validating other arguments.
Preserved regex cache reuse through compile_and_cache_regex.
Removed the dependency on itertools::izip by consolidating processing into a single row loop.
Kept outer type dispatch and public interfaces unchanged.

Are these changes tested?

Yes.

Added tests covering behavior that is sensitive to validation and error ordering:

test_regexp_count_error_order_invalid_scalar_regex_before_start_len
test_regexp_count_error_order_flags_len_before_start_len

Existing regexp_count tests continue to run unchanged.

Are there any user-facing changes?

No.

This change is an internal refactor intended to preserve existing behavior, error messages, validation ordering, and SQL-visible results.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

…ndling - Replaced 8-arm scalar/array match with a single row loop. - Introduced private input sources: StringValueSource and StartValueSource. - Centralized length checks through validate_array_len function. - Preserved scalar NULL regex zero short-circuit behavior. - Maintained compile-once path for scalar regex and flags. - Retained cache mechanism for row-varying regex/flags. - Removed dependency on itertools::izip.

…ments - Added values_len - Replaced extension trait with string_value_opt - Introduced compile_scalar_pattern - Bound input value once per row for efficiency - Renamed private constructors: regex_arg to improve clarity, flags_arg for better understanding - Added comments to clarify scalar flags and start value (0) behavior

- Maintained old error ordering: - Scalar regex with scalar flags compiles before start length validation. - Scalar regex with array flags validates flags before start. - Regex array retains the order of regex/start/flags. - Implemented suggestion to use StartValueSource::Scalar(i64) instead of Option<i64>. - Added regression tests to verify error ordering.

- Updated regexp_count_inner to take S directly instead of &S. - Adjusted call sites to pass values.as_string() directly without an additional reference. - Changed flags_array type to Option<S> from Option<&S>. - Modified StringValueSource::Array to use S instead of &S. - Retained lifetime 'a solely for Arrow StringArrayType<'a> and for returning &str. - Confirmed that there are no behavioral changes.

kosiew · 2026-06-05T05:45:35Z

run benchmark regexp_count

adriangbot · 2026-06-05T05:49:35Z

🤖 Criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4628610409-447-62nbp 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing duplication-complexity-02-22669 (4a2b88f) to e1d8d46 (merge-base) diff
BENCH_NAME=regexp_count
BENCH_COMMAND=cargo bench --features=parquet --bench regexp_count
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

…rrow &values and &array - Updated string_value_opt to now take &S. - Modified call sites to borrow &values and &array. - Retained simplified regexp_count_inner(values: S, flags_array: Option<S>).

adriangbot · 2026-06-05T05:58:32Z

🤖 Criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                                                                      duplication-complexity-02-22669        main
-----                                                                      -------------------------------        ----
regexp_count size=1024/regexp_count_no_start [size=1024, str_len=128]      1.01    100.9±0.26µs        ? ?/sec    1.00     99.4±0.48µs        ? ?/sec
regexp_count size=1024/regexp_count_no_start [size=1024, str_len=32]       1.03     79.0±0.04µs        ? ?/sec    1.00     76.5±0.03µs        ? ?/sec
regexp_count size=1024/regexp_count_with_start [size=1024, str_len=128]    1.00    100.1±0.52µs        ? ?/sec    1.04    104.1±0.83µs        ? ?/sec
regexp_count size=1024/regexp_count_with_start [size=1024, str_len=32]     1.00     77.4±0.12µs        ? ?/sec    1.00     77.3±0.38µs        ? ?/sec
regexp_count size=4096/regexp_count_no_start [size=4096, str_len=128]      1.02    440.9±1.04µs        ? ?/sec    1.00    431.4±0.93µs        ? ?/sec
regexp_count size=4096/regexp_count_no_start [size=4096, str_len=32]       1.01    312.9±0.17µs        ? ?/sec    1.00    308.4±0.42µs        ? ?/sec
regexp_count size=4096/regexp_count_with_start [size=4096, str_len=128]    1.01    451.2±1.79µs        ? ?/sec    1.00    445.3±2.89µs        ? ?/sec
regexp_count size=4096/regexp_count_with_start [size=4096, str_len=32]     1.00    305.5±0.85µs        ? ?/sec    1.00    304.6±1.39µs        ? ?/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	120.0s
Peak memory	4.3 GiB
Avg memory	4.3 GiB
CPU user	128.7s
CPU sys	0.8s
Peak spill	0 B

branch

Metric	Value
Wall time	120.0s
Peak memory	4.3 GiB
Avg memory	4.3 GiB
CPU user	129.8s
CPU sys	0.2s
Peak spill	0 B

File an issue against this benchmark runner

kosiew · 2026-06-05T06:21:45Z

run benchmark regexp_count

adriangbot · 2026-06-05T06:24:35Z

🤖 Criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4628797823-449-lkpr7 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing duplication-complexity-02-22669 (c68d5d5) to e1d8d46 (merge-base) diff
BENCH_NAME=regexp_count
BENCH_COMMAND=cargo bench --features=parquet --bench regexp_count
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-06-05T06:30:38Z

🤖 Criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                                                                      duplication-complexity-02-22669        main
-----                                                                      -------------------------------        ----
regexp_count size=1024/regexp_count_no_start [size=1024, str_len=128]      1.00     99.3±0.67µs        ? ?/sec    1.01    100.0±0.12µs        ? ?/sec
regexp_count size=1024/regexp_count_no_start [size=1024, str_len=32]       1.01     78.1±0.40µs        ? ?/sec    1.00     77.2±0.02µs        ? ?/sec
regexp_count size=1024/regexp_count_with_start [size=1024, str_len=128]    1.00     99.8±0.81µs        ? ?/sec    1.03    102.9±0.30µs        ? ?/sec
regexp_count size=1024/regexp_count_with_start [size=1024, str_len=32]     1.00     77.0±0.22µs        ? ?/sec    1.01     77.5±0.04µs        ? ?/sec
regexp_count size=4096/regexp_count_no_start [size=4096, str_len=128]      1.00    436.4±1.37µs        ? ?/sec    1.00    437.7±0.79µs        ? ?/sec
regexp_count size=4096/regexp_count_no_start [size=4096, str_len=32]       1.01    308.6±1.68µs        ? ?/sec    1.00    306.0±0.05µs        ? ?/sec
regexp_count size=4096/regexp_count_with_start [size=4096, str_len=128]    1.01    448.2±0.63µs        ? ?/sec    1.00    445.7±0.71µs        ? ?/sec
regexp_count size=4096/regexp_count_with_start [size=4096, str_len=32]     1.00    303.6±0.56µs        ? ?/sec    1.01    306.0±0.28µs        ? ?/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	120.0s
Peak memory	4.3 GiB
Avg memory	4.3 GiB
CPU user	131.8s
CPU sys	0.8s
Peak spill	0 B

branch

Metric	Value
Wall time	115.0s
Peak memory	4.3 GiB
Avg memory	4.3 GiB
CPU user	128.3s
CPU sys	0.2s
Peak spill	0 B

File an issue against this benchmark runner

kosiew added 3 commits June 5, 2026 13:22

github-actions Bot added the functions Changes to functions implementation label Jun 5, 2026

fix: update string_value_opt to accept &S and adjust call sites to bo…

c68d5d5

…rrow &values and &array - Updated string_value_opt to now take &S. - Modified call sites to borrow &values and &array. - Retained simplified regexp_count_inner(values: S, flags_array: Option<S>).

kosiew marked this pull request as ready for review June 5, 2026 08:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `regexp_count_inner` into a unified row-processing pipeline while preserving behavior#22770

Refactor `regexp_count_inner` into a unified row-processing pipeline while preserving behavior#22770
kosiew wants to merge 5 commits into
apache:mainfrom
kosiew:duplication-complexity-02-22669

kosiew commented Jun 5, 2026

Uh oh!

kosiew commented Jun 5, 2026

Uh oh!

adriangbot commented Jun 5, 2026

Uh oh!

adriangbot commented Jun 5, 2026

Uh oh!

kosiew commented Jun 5, 2026

Uh oh!

adriangbot commented Jun 5, 2026

Uh oh!

adriangbot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kosiew commented Jun 5, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

LLM-generated code disclosure

Uh oh!

kosiew commented Jun 5, 2026

Uh oh!

adriangbot commented Jun 5, 2026

Uh oh!

adriangbot commented Jun 5, 2026

Uh oh!

kosiew commented Jun 5, 2026

Uh oh!

adriangbot commented Jun 5, 2026

Uh oh!

adriangbot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants