Skip to content

fix: use runes instead of codeUnits in AhoCorasick.search#32

Open
KirthiSaiT wants to merge 1 commit intomaster-wayne7:developfrom
KirthiSaiT:fix/aho-corasick-runes-codeunits
Open

fix: use runes instead of codeUnits in AhoCorasick.search#32
KirthiSaiT wants to merge 1 commit intomaster-wayne7:developfrom
KirthiSaiT:fix/aho-corasick-runes-codeunits

Conversation

@KirthiSaiT
Copy link
Copy Markdown

@KirthiSaiT KirthiSaiT commented Apr 11, 2026

Description

addWord builds the trie using .runes (Unicode code points) but search
was traversing using .codeUnits (UTF-16 code units). For characters above
U+FFFF these produce different integer sequences, causing those words to
silently pass through the filter undetected.
Fixed by changing textLower.codeUnits to textLower.runes.toList() in
AhoCorasick.search so both methods use the same encoding.

Related Issue

Closes #31

Type of Change

  • Bug fix

Checklist

  • Code compiles
  • Tests added
  • No breaking changes

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Fixed Unicode character handling in search to properly support non-ASCII and multibyte characters.
  • Documentation

    • Enhanced documentation with clearer specifications on search behavior, case-insensitivity, and match index formatting.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 11, 2026

📝 Walkthrough

Walkthrough

Enhanced inline documentation for the AhoCorasick trie implementation, clarifying responsibilities of trie nodes, failure links, and output accumulation. Fixed a Unicode handling bug in the search method by switching from UTF-16 code units to Unicode runes, aligning it with how addWord constructs the trie, enabling proper matching of emoji and supplementary Unicode characters.

Changes

Cohort / File(s) Summary
Documentation & Unicode Bug Fix
lib/src/aho_corasick.dart
Expanded documentation for TrieNode and AhoCorasick class with detailed explanations of trie semantics, fail links, and output accumulation. Added explicit API expectations for addWord (lowercasing behavior) and buildFailureLinks (call once after inserts). Fixed bug: Changed search method from iterating textLower.codeUnits to textLower.runes.toList(), enabling correct Unicode/emoji matching. Updated search documentation to clarify zero-based inclusive end indices and case-insensitive behavior.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A trie's tale in runes we write,
No more emojis lost from sight!
From code units' trap we now are free,
Unicode flows in harmony. 🎉
Docs expanded, bugs are past,
This Aho-Corasick fix will last!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main code change: replacing codeUnits with runes in AhoCorasick.search to fix character matching.
Linked Issues check ✅ Passed The PR directly addresses issue #31 by changing search to use runes instead of codeUnits, resolving the mismatch that caused supplementary Unicode characters to fail matching.
Out of Scope Changes check ✅ Passed All changes are within scope: documentation enhancements for clarity and the core fix replacing codeUnits with runes.toList() in the search method, all directly related to the bug fix.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/src/aho_corasick.dart (1)

107-150: ⚠️ Potential issue | 🔴 Critical

Convert rune indices to code-unit indices in search() return values — current indices cause incorrect text masking for non-BMP characters.

Line 149 stores rune loop counter i as the match end index. Downstream consumers (lines 214–215, 259, 289 in safe_text_filter.dart) use these indices with code-unit APIs: word.length, substring(), and codeUnitAt(). For non-BMP characters (emoji, certain Unicode scripts), a single rune encodes as 2 code units in UTF-16, causing index mismatch and incorrect boundary calculations.

Proposed fix
-    final units = textLower.runes.toList();
-
-    for (int i = 0; i < units.length; i++) {
-      final rune = units[i];
+    int codeUnitEndIndex = -1;
+    for (final rune in textLower.runes) {
+      codeUnitEndIndex += (rune > 0xFFFF) ? 2 : 1;

-      if (current.outputs.isNotEmpty) {
-        matches.putIfAbsent(i, () => []).addAll(current.outputs);
+      if (current.outputs.isNotEmpty) {
+        matches.putIfAbsent(codeUnitEndIndex, () => []).addAll(current.outputs);
       }

Also update the docstring to clarify UTF-16 code-unit indices.


ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ff03c41b-f7a2-4293-85d6-3df476584066

📥 Commits

Reviewing files that changed from the base of the PR and between 40ca9ce and 11fa705.

📒 Files selected for processing (1)
  • lib/src/aho_corasick.dart

@master-wayne7
Copy link
Copy Markdown
Owner

@KirthiSaiT can you please only push the change regarding runes in this PR and add test cases as mentioned previously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] emoji and supplementary Unicode characters silently fail to match in AhoCorasick search

2 participants