Skip to content

Conversation

@treysp
Copy link
Collaborator

@treysp treysp commented Dec 2, 2025

Context

REGEXP_EXTRACT extracts substrings based on a regex pattern (and is called REGEXP_SUBSTR in multiple dialects).

In some dialects, REGEXP_EXTRACT has a position arg that specifies an index value for where in the string the matching should begin. This effectively means that a SUBSTR operation occurs before the regex pattern match.

DuckDB does not support the position arg, so we match the source semantics with a SUBSTR call inside the REGEXP_EXTRACT call.

Problem

What should happen when the position index value is larger than the length of the string?

In SUBSTR, the semantics are that it returns an empty string '', but in REGEXP_EXTRACT all dialects other than Redshift return NULL.

Therefore, for those source dialects we must wrap the DuckDB SUBSTR call with NULLIF(SUBSTR(arg, pos), '').

Wrinkle

REGEXP_EXTRACT always returns a single match or capture group (either the first by default or from a user-specified arg). In contrast, REGEXP_EXTRACT_ALL returns all matches or capture groups in an ARRAY.

When position overflows in a REGEXP_EXTRACT_ALL call, an empty array is returned that contains neither empty strings nor NULL values.

closes #6442

@treysp treysp changed the title Fix(duckdb): wrap REGEXP_EXTRACT SUBSTRING() call in NULLIF Fix: REGEXP_EXTRACT position arg overflow Dec 4, 2025
@treysp treysp force-pushed the trey/duckdb-array-extract-nullif branch from 0f5a174 to 1cacc4a Compare December 4, 2025 17:53
@treysp treysp force-pushed the trey/duckdb-array-extract-nullif branch from 1cacc4a to 070c990 Compare December 4, 2025 18:03
@georgesittas georgesittas changed the title Fix: REGEXP_EXTRACT position arg overflow Fix!: REGEXP_EXTRACT position arg overflow Dec 4, 2025
@treysp treysp merged commit df4c1d3 into main Dec 4, 2025
8 checks passed
@treysp treysp deleted the trey/duckdb-array-extract-nullif branch December 4, 2025 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bigquery -> duckdb] REGEXP_EXTRACT transpilation can result in different return type

4 participants