Fix!: REGEXP_EXTRACT position arg overflow #6458
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context
REGEXP_EXTRACTextracts substrings based on a regex pattern (and is calledREGEXP_SUBSTRin multiple dialects).In some dialects,
REGEXP_EXTRACThas apositionarg that specifies an index value for where in the string the matching should begin. This effectively means that aSUBSTRoperation occurs before the regex pattern match.DuckDB does not support the
positionarg, so we match the source semantics with aSUBSTRcall inside theREGEXP_EXTRACTcall.Problem
What should happen when the
positionindex value is larger than the length of the string?In
SUBSTR, the semantics are that it returns an empty string'', but inREGEXP_EXTRACTall dialects other than Redshift returnNULL.Therefore, for those source dialects we must wrap the DuckDB
SUBSTRcall withNULLIF(SUBSTR(arg, pos), '').Wrinkle
REGEXP_EXTRACTalways returns a single match or capture group (either the first by default or from a user-specified arg). In contrast,REGEXP_EXTRACT_ALLreturns all matches or capture groups in anARRAY.When
positionoverflows in aREGEXP_EXTRACT_ALLcall, an empty array is returned that contains neither empty strings norNULLvalues.closes #6442