fix: emoji and supplementary Unicode characters match failure in AhoCorasick#33
Conversation
📝 WalkthroughWalkthroughModified Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/src/aho_corasick.dart (1)
59-76:⚠️ Potential issue | 🔴 CriticalPreserve UTF-16 end offsets while matching by runes.
Line 76 stores
i, which is now a rune index after the switch totextLower.runes.toList(). This breaks thesearch()contract: the docstring states it returns indices where matches END, and callers insafe_text_filter.dart(lines 127–137, 210–219, 257–286, 308–318) use these keys withsubstring(),text[...], andcodeUnitAt(), which operate on UTF-16 code units. For"😀bad", the character"d"ends at rune index 3 but UTF-16 index 4. The mismatch breaks boundary checks, substring ranges, and can yield negative start indices.Minimal fix
Map<int, List<String>> search(String text) { final matches = <int, List<String>>{}; TrieNode? current = _root; final textLower = text.toLowerCase(); final units = textLower.runes.toList(); + var codeUnitOffset = 0; for (int i = 0; i < units.length; i++) { final rune = units[i]; + final runeWidth = rune > 0xFFFF ? 2 : 1; while (current != null && !current.children.containsKey(rune)) { current = current.fail; } if (current == null) { current = _root; + codeUnitOffset += runeWidth; continue; } current = current.children[rune]!; if (current.outputs.isNotEmpty) { - matches.putIfAbsent(i, () => []).addAll(current.outputs); + final endIndex = codeUnitOffset + runeWidth - 1; + matches.putIfAbsent(endIndex, () => []).addAll(current.outputs); } + codeUnitOffset += runeWidth; } return matches; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/src/aho_corasick.dart` around lines 59 - 76, The search() routine now iterates runes (units = textLower.runes.toList()) but stores the loop index i into matches as the match end offset, producing rune indices instead of UTF-16 code unit offsets; update the logic that records match end positions (where matches.putIfAbsent(i, ...) is called) to convert the current rune index to the corresponding UTF-16 code unit index before using it as the key so callers using substring()/codeUnitAt() still get UTF-16 offsets. Locate the loop using units/textLower.runes, the matches map mutation, and the variables current/_root/current.outputs and compute the proper UTF-16 end offset for each rune (e.g., by tracking cumulative code unit lengths or mapping rune positions to code unit indices) and use that UTF-16 index as the map key.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@lib/src/aho_corasick.dart`:
- Around line 59-76: The search() routine now iterates runes (units =
textLower.runes.toList()) but stores the loop index i into matches as the match
end offset, producing rune indices instead of UTF-16 code unit offsets; update
the logic that records match end positions (where matches.putIfAbsent(i, ...) is
called) to convert the current rune index to the corresponding UTF-16 code unit
index before using it as the key so callers using substring()/codeUnitAt() still
get UTF-16 offsets. Locate the loop using units/textLower.runes, the matches map
mutation, and the variables current/_root/current.outputs and compute the proper
UTF-16 end offset for each rune (e.g., by tracking cumulative code unit lengths
or mapping rune positions to code unit indices) and use that UTF-16 index as the
map key.
|
@Deepak8858 can you please add the test cases wrt to the change you made. For example, verify if the search is working if you have given a foul emoji(check the end lines of en.txt in assets you'll find them) in the input. |
Fixes #31
Changed
AhoCorasick.searchto use.runesinstead of.codeUnitsto match the behavior ofaddWord. This ensures that emoji and other supplementary Unicode characters (outside BMP) are correctly matched during search.Summary by CodeRabbit