fix: regex simplification of anchored patterns produces wrong results#22727
fix: regex simplification of anchored patterns produces wrong results#22727lyne7-sc wants to merge 7 commits into
Conversation
| Barrr | ||
|
|
||
| query T | ||
| SELECT * FROM test WHERE column1 ~* '^(barrr|bazzz)$' |
There was a problem hiding this comment.
Tests with negation+regex are missing (!~ and !~*).
There was a problem hiding this comment.
added !~ and !~* coverage.
| Bazzz | ||
|
|
||
| statement ok | ||
| CREATE TABLE test_regex_utf8view(s VARCHAR) AS VALUES ('foo'), ('Bazzz'); |
There was a problem hiding this comment.
Question to educate myself: How the values here are Utf8View ?
I'd expect some casting to achieve that.
There was a problem hiding this comment.
I think it's because the config map_string_types_to_utf8view defaults to true, so a VARCHAR column is planned as Utf8View in slt
datafusion/datafusion/common/src/config.rs
Lines 292 to 295 in e1d8d46
| query T | ||
| SELECT * FROM test_regex_utf8view WHERE s ~* '^bazzz$' | ||
| ---- | ||
| Bazzz |
There was a problem hiding this comment.
How this asserts the expected result ?
Neither the optimization nor the type is asserted.
Maybe use EXPLAIN ... and assert its output instead ?!
There was a problem hiding this comment.
sounds good to me, it makes the intent clearer. I've already added EXPLAIN assertions for all the anchored cases.
Which issue does this PR close?
Rationale for this change
The regex simplification rule rewrites anchored regex matches (
^literal$,^(a|b)$) into cheaper=/IN/LIKEexpressions. Two bugs in that path:Utf8vialit(...), so on aUtf8View/LargeUtf8column the rewritten comparison failed at execution withInvalid comparison operation: Utf8View == Utf8.~*(case-insensitive) anchored literal was rewritten to a case-sensitive=, silently dropping rows that differ only in case.What changes are included in this PR?
string_scalar.to_expr(...)so its type follows the column type (Utf8/LargeUtf8/Utf8View), consistent with the existingLIKEbranches.~*anchored literals toILIKEinstead of=. The existingis_safe_for_likeguard ensures the literal has no%/_, so this is an exact case-insensitive match. (Anchored alternations under~*still fall back to regex evaluation.)Are these changes tested?
Yes.
predicates.sltnow covers anchored~/~*, single literals and alternations, over bothUtf8andUtf8Viewcolumns. Existingregex.rsunit tests still pass.Are there any user-facing changes?
Yes, bug fixes only