fix: use runes instead of codeUnits in AhoCorasick.search#32
fix: use runes instead of codeUnits in AhoCorasick.search#32KirthiSaiT wants to merge 1 commit intomaster-wayne7:developfrom
Conversation
📝 WalkthroughWalkthroughEnhanced inline documentation for the AhoCorasick trie implementation, clarifying responsibilities of trie nodes, failure links, and output accumulation. Fixed a Unicode handling bug in the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/src/aho_corasick.dart (1)
107-150:⚠️ Potential issue | 🔴 CriticalConvert rune indices to code-unit indices in
search()return values — current indices cause incorrect text masking for non-BMP characters.Line 149 stores rune loop counter
ias the match end index. Downstream consumers (lines 214–215, 259, 289 insafe_text_filter.dart) use these indices with code-unit APIs:word.length,substring(), andcodeUnitAt(). For non-BMP characters (emoji, certain Unicode scripts), a single rune encodes as 2 code units in UTF-16, causing index mismatch and incorrect boundary calculations.Proposed fix
- final units = textLower.runes.toList(); - - for (int i = 0; i < units.length; i++) { - final rune = units[i]; + int codeUnitEndIndex = -1; + for (final rune in textLower.runes) { + codeUnitEndIndex += (rune > 0xFFFF) ? 2 : 1; - if (current.outputs.isNotEmpty) { - matches.putIfAbsent(i, () => []).addAll(current.outputs); + if (current.outputs.isNotEmpty) { + matches.putIfAbsent(codeUnitEndIndex, () => []).addAll(current.outputs); }Also update the docstring to clarify UTF-16 code-unit indices.
|
@KirthiSaiT can you please only push the change regarding runes in this PR and add test cases as mentioned previously. |
Description
addWordbuilds the trie using.runes(Unicode code points) butsearchwas traversing using
.codeUnits(UTF-16 code units). For characters aboveU+FFFF these produce different integer sequences, causing those words to
silently pass through the filter undetected.
Fixed by changing
textLower.codeUnitstotextLower.runes.toList()inAhoCorasick.searchso both methods use the same encoding.Related Issue
Closes #31
Type of Change
Checklist
Summary by CodeRabbit
Release Notes
Bug Fixes
Documentation