Skip to content

fix: emoji and supplementary Unicode characters match failure in AhoCorasick#33

Open
Deepak8858 wants to merge 1 commit intomaster-wayne7:developfrom
Deepak8858:fix/issue-31-unicode-match-failure
Open

fix: emoji and supplementary Unicode characters match failure in AhoCorasick#33
Deepak8858 wants to merge 1 commit intomaster-wayne7:developfrom
Deepak8858:fix/issue-31-unicode-match-failure

Conversation

@Deepak8858
Copy link
Copy Markdown

@Deepak8858 Deepak8858 commented Apr 12, 2026

Fixes #31

Changed AhoCorasick.search to use .runes instead of .codeUnits to match the behavior of addWord. This ensures that emoji and other supplementary Unicode characters (outside BMP) are correctly matched during search.

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced search functionality to properly handle Unicode characters and international text, ensuring accurate results across different character sets and scripts.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 12, 2026

📝 Walkthrough

Walkthrough

Modified AhoCorasick.search to iterate over Unicode code points (runes) instead of UTF-16 code units, aligning the search method with the rune-based approach already used in addWord. This resolves silent matching failures for supplementary Unicode characters and emojis.

Changes

Cohort / File(s) Summary
AhoCorasick Unicode Traversal Fix
lib/src/aho_corasick.dart
Changed search method to use .runes.toList() instead of .codeUnits for text iteration, ensuring consistent Unicode handling across trie construction and traversal.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 Runes now hop where code units once fell,
Emojis and chars beyond BMP's spell,
No more silent failures in the trie's embrace,
Each Unicode point finds its rightful place! 🌍✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: fixing emoji and supplementary Unicode character matching in AhoCorasick.
Linked Issues check ✅ Passed The change directly addresses issue #31 by replacing codeUnits with runes in search to align with addWord implementation.
Out of Scope Changes check ✅ Passed The change is narrowly scoped: only the search method iteration basis is modified, directly addressing the Unicode mismatch without extraneous modifications.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/src/aho_corasick.dart (1)

59-76: ⚠️ Potential issue | 🔴 Critical

Preserve UTF-16 end offsets while matching by runes.

Line 76 stores i, which is now a rune index after the switch to textLower.runes.toList(). This breaks the search() contract: the docstring states it returns indices where matches END, and callers in safe_text_filter.dart (lines 127–137, 210–219, 257–286, 308–318) use these keys with substring(), text[...], and codeUnitAt(), which operate on UTF-16 code units. For "😀bad", the character "d" ends at rune index 3 but UTF-16 index 4. The mismatch breaks boundary checks, substring ranges, and can yield negative start indices.

Minimal fix
   Map<int, List<String>> search(String text) {
     final matches = <int, List<String>>{};
     TrieNode? current = _root;
     final textLower = text.toLowerCase();
     final units = textLower.runes.toList();
+    var codeUnitOffset = 0;
 
     for (int i = 0; i < units.length; i++) {
       final rune = units[i];
+      final runeWidth = rune > 0xFFFF ? 2 : 1;
 
       while (current != null && !current.children.containsKey(rune)) {
         current = current.fail;
       }
 
       if (current == null) {
         current = _root;
+        codeUnitOffset += runeWidth;
         continue;
       }
 
       current = current.children[rune]!;
 
       if (current.outputs.isNotEmpty) {
-        matches.putIfAbsent(i, () => []).addAll(current.outputs);
+        final endIndex = codeUnitOffset + runeWidth - 1;
+        matches.putIfAbsent(endIndex, () => []).addAll(current.outputs);
       }
+      codeUnitOffset += runeWidth;
     }
 
     return matches;
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/src/aho_corasick.dart` around lines 59 - 76, The search() routine now
iterates runes (units = textLower.runes.toList()) but stores the loop index i
into matches as the match end offset, producing rune indices instead of UTF-16
code unit offsets; update the logic that records match end positions (where
matches.putIfAbsent(i, ...) is called) to convert the current rune index to the
corresponding UTF-16 code unit index before using it as the key so callers using
substring()/codeUnitAt() still get UTF-16 offsets. Locate the loop using
units/textLower.runes, the matches map mutation, and the variables
current/_root/current.outputs and compute the proper UTF-16 end offset for each
rune (e.g., by tracking cumulative code unit lengths or mapping rune positions
to code unit indices) and use that UTF-16 index as the map key.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@lib/src/aho_corasick.dart`:
- Around line 59-76: The search() routine now iterates runes (units =
textLower.runes.toList()) but stores the loop index i into matches as the match
end offset, producing rune indices instead of UTF-16 code unit offsets; update
the logic that records match end positions (where matches.putIfAbsent(i, ...) is
called) to convert the current rune index to the corresponding UTF-16 code unit
index before using it as the key so callers using substring()/codeUnitAt() still
get UTF-16 offsets. Locate the loop using units/textLower.runes, the matches map
mutation, and the variables current/_root/current.outputs and compute the proper
UTF-16 end offset for each rune (e.g., by tracking cumulative code unit lengths
or mapping rune positions to code unit indices) and use that UTF-16 index as the
map key.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0ace4f63-f8ac-42b5-a2c0-8e885ec9bd6d

📥 Commits

Reviewing files that changed from the base of the PR and between 40ca9ce and 6f804ee.

📒 Files selected for processing (1)
  • lib/src/aho_corasick.dart

@master-wayne7
Copy link
Copy Markdown
Owner

@Deepak8858 can you please add the test cases wrt to the change you made. For example, verify if the search is working if you have given a foul emoji(check the end lines of en.txt in assets you'll find them) in the input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] emoji and supplementary Unicode characters silently fail to match in AhoCorasick search

2 participants