fix: emoji and supplementary Unicode characters match failure in AhoCorasick by Deepak8858 · Pull Request #33 · master-wayne7/safe_text

Deepak8858 · 2026-04-12T09:30:54Z

Fixes #31

Changed AhoCorasick.search to use .runes instead of .codeUnits to match the behavior of addWord. This ensures that emoji and other supplementary Unicode characters (outside BMP) are correctly matched during search.

Summary by CodeRabbit

Bug Fixes
- Enhanced search functionality to properly handle Unicode characters and international text, ensuring accurate results across different character sets and scripts.

…orasick (closes master-wayne7#31)

coderabbitai · 2026-04-12T09:31:14Z

📝 Walkthrough

Walkthrough

Modified AhoCorasick.search to iterate over Unicode code points (runes) instead of UTF-16 code units, aligning the search method with the rune-based approach already used in addWord. This resolves silent matching failures for supplementary Unicode characters and emojis.

Changes

Cohort / File(s)	Summary
AhoCorasick Unicode Traversal Fix `lib/src/aho_corasick.dart`	Changed search method to use `.runes.toList()` instead of `.codeUnits` for text iteration, ensuring consistent Unicode handling across trie construction and traversal.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 Runes now hop where code units once fell,
Emojis and chars beyond BMP's spell,
No more silent failures in the trie's embrace,
Each Unicode point finds its rightful place! 🌍✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: fixing emoji and supplementary Unicode character matching in AhoCorasick.
Linked Issues check	✅ Passed	The change directly addresses issue `#31` by replacing codeUnits with runes in search to align with addWord implementation.
Out of Scope Changes check	✅ Passed	The change is narrowly scoped: only the search method iteration basis is modified, directly addressing the Unicode mismatch without extraneous modifications.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/src/aho_corasick.dart (1)

59-76: ⚠️ Potential issue | 🔴 Critical

Preserve UTF-16 end offsets while matching by runes.

Line 76 stores i, which is now a rune index after the switch to textLower.runes.toList(). This breaks the search() contract: the docstring states it returns indices where matches END, and callers in safe_text_filter.dart (lines 127–137, 210–219, 257–286, 308–318) use these keys with substring(), text[...], and codeUnitAt(), which operate on UTF-16 code units. For "😀bad", the character "d" ends at rune index 3 but UTF-16 index 4. The mismatch breaks boundary checks, substring ranges, and can yield negative start indices.

Minimal fix

   Map<int, List<String>> search(String text) {
     final matches = <int, List<String>>{};
     TrieNode? current = _root;
     final textLower = text.toLowerCase();
     final units = textLower.runes.toList();
+    var codeUnitOffset = 0;
 
     for (int i = 0; i < units.length; i++) {
       final rune = units[i];
+      final runeWidth = rune > 0xFFFF ? 2 : 1;
 
       while (current != null && !current.children.containsKey(rune)) {
         current = current.fail;
       }
 
       if (current == null) {
         current = _root;
+        codeUnitOffset += runeWidth;
         continue;
       }
 
       current = current.children[rune]!;
 
       if (current.outputs.isNotEmpty) {
-        matches.putIfAbsent(i, () => []).addAll(current.outputs);
+        final endIndex = codeUnitOffset + runeWidth - 1;
+        matches.putIfAbsent(endIndex, () => []).addAll(current.outputs);
       }
+      codeUnitOffset += runeWidth;
     }
 
     return matches;
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@lib/src/aho_corasick.dart` around lines 59 - 76, The search() routine now
iterates runes (units = textLower.runes.toList()) but stores the loop index i
into matches as the match end offset, producing rune indices instead of UTF-16
code unit offsets; update the logic that records match end positions (where
matches.putIfAbsent(i, ...) is called) to convert the current rune index to the
corresponding UTF-16 code unit index before using it as the key so callers using
substring()/codeUnitAt() still get UTF-16 offsets. Locate the loop using
units/textLower.runes, the matches map mutation, and the variables
current/_root/current.outputs and compute the proper UTF-16 end offset for each
rune (e.g., by tracking cumulative code unit lengths or mapping rune positions
to code unit indices) and use that UTF-16 index as the map key.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@lib/src/aho_corasick.dart`:
- Around line 59-76: The search() routine now iterates runes (units =
textLower.runes.toList()) but stores the loop index i into matches as the match
end offset, producing rune indices instead of UTF-16 code unit offsets; update
the logic that records match end positions (where matches.putIfAbsent(i, ...) is
called) to convert the current rune index to the corresponding UTF-16 code unit
index before using it as the key so callers using substring()/codeUnitAt() still
get UTF-16 offsets. Locate the loop using units/textLower.runes, the matches map
mutation, and the variables current/_root/current.outputs and compute the proper
UTF-16 end offset for each rune (e.g., by tracking cumulative code unit lengths
or mapping rune positions to code unit indices) and use that UTF-16 index as the
map key.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0ace4f63-f8ac-42b5-a2c0-8e885ec9bd6d

📥 Commits

Reviewing files that changed from the base of the PR and between 40ca9ce and 6f804ee.

📒 Files selected for processing (1)

lib/src/aho_corasick.dart

master-wayne7 · 2026-04-13T05:36:45Z

@Deepak8858 can you please add the test cases wrt to the change you made. For example, verify if the search is working if you have given a foul emoji(check the end lines of en.txt in assets you'll find them) in the input.

fix: emoji and supplementary Unicode characters match failure in AhoC…

6f804ee

…orasick (closes master-wayne7#31)

coderabbitai bot reviewed Apr 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: emoji and supplementary Unicode characters match failure in AhoCorasick#33

fix: emoji and supplementary Unicode characters match failure in AhoCorasick#33
Deepak8858 wants to merge 1 commit intomaster-wayne7:developfrom
Deepak8858:fix/issue-31-unicode-match-failure

Deepak8858 commented Apr 12, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 12, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

master-wayne7 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Deepak8858 commented Apr 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

master-wayne7 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Deepak8858 commented Apr 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 12, 2026 •

edited

Loading