fix: use runes instead of codeUnits in AhoCorasick.search by KirthiSaiT · Pull Request #32 · master-wayne7/safe_text

KirthiSaiT · 2026-04-11T15:55:16Z

Description

addWord builds the trie using .runes (Unicode code points) but search
was traversing using .codeUnits (UTF-16 code units). For characters above
U+FFFF these produce different integer sequences, causing those words to
silently pass through the filter undetected.
Fixed by changing textLower.codeUnits to textLower.runes.toList() in
AhoCorasick.search so both methods use the same encoding.

Related Issue

Closes #31

Type of Change

Bug fix

Checklist

Code compiles
Tests added
No breaking changes

Summary by CodeRabbit

Release Notes

Bug Fixes
- Fixed Unicode character handling in search to properly support non-ASCII and multibyte characters.
Documentation
- Enhanced documentation with clearer specifications on search behavior, case-insensitivity, and match index formatting.

coderabbitai · 2026-04-11T15:55:30Z

📝 Walkthrough

Walkthrough

Enhanced inline documentation for the AhoCorasick trie implementation, clarifying responsibilities of trie nodes, failure links, and output accumulation. Fixed a Unicode handling bug in the search method by switching from UTF-16 code units to Unicode runes, aligning it with how addWord constructs the trie, enabling proper matching of emoji and supplementary Unicode characters.

Changes

Cohort / File(s)	Summary
Documentation & Unicode Bug Fix `lib/src/aho_corasick.dart`	Expanded documentation for `TrieNode` and `AhoCorasick` class with detailed explanations of trie semantics, `fail` links, and output accumulation. Added explicit API expectations for `addWord` (lowercasing behavior) and `buildFailureLinks` (call once after inserts). Fixed bug: Changed `search` method from iterating `textLower.codeUnits` to `textLower.runes.toList()`, enabling correct Unicode/emoji matching. Updated search documentation to clarify zero-based inclusive end indices and case-insensitive behavior.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A trie's tale in runes we write,
No more emojis lost from sight!
From code units' trap we now are free,
Unicode flows in harmony. 🎉
Docs expanded, bugs are past,
This Aho-Corasick fix will last! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely summarizes the main code change: replacing codeUnits with runes in AhoCorasick.search to fix character matching.
Linked Issues check	✅ Passed	The PR directly addresses issue `#31` by changing search to use runes instead of codeUnits, resolving the mismatch that caused supplementary Unicode characters to fail matching.
Out of Scope Changes check	✅ Passed	All changes are within scope: documentation enhancements for clarity and the core fix replacing codeUnits with runes.toList() in the search method, all directly related to the bug fix.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/src/aho_corasick.dart (1)
107-150: ⚠️ Potential issue | 🔴 Critical

Convert rune indices to code-unit indices in search() return values — current indices cause incorrect text masking for non-BMP characters.

Line 149 stores rune loop counter i as the match end index. Downstream consumers (lines 214–215, 259, 289 in safe_text_filter.dart) use these indices with code-unit APIs: word.length, substring(), and codeUnitAt(). For non-BMP characters (emoji, certain Unicode scripts), a single rune encodes as 2 code units in UTF-16, causing index mismatch and incorrect boundary calculations.
Proposed fix
-    final units = textLower.runes.toList();
-
-    for (int i = 0; i < units.length; i++) {
-      final rune = units[i];
+    int codeUnitEndIndex = -1;
+    for (final rune in textLower.runes) {
+      codeUnitEndIndex += (rune > 0xFFFF) ? 2 : 1;

-      if (current.outputs.isNotEmpty) {
-        matches.putIfAbsent(i, () => []).addAll(current.outputs);
+      if (current.outputs.isNotEmpty) {
+        matches.putIfAbsent(codeUnitEndIndex, () => []).addAll(current.outputs);
       }
Also update the docstring to clarify UTF-16 code-unit indices.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ff03c41b-f7a2-4293-85d6-3df476584066

📥 Commits

Reviewing files that changed from the base of the PR and between 40ca9ce and 11fa705.

📒 Files selected for processing (1)

lib/src/aho_corasick.dart

master-wayne7 · 2026-04-13T11:16:05Z

@KirthiSaiT can you please only push the change regarding runes in this PR and add test cases as mentioned previously.

fix: use runes instead of codeUnits in AhoCorasick.search

11fa705

coderabbitai bot reviewed Apr 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use runes instead of codeUnits in AhoCorasick.search#32

fix: use runes instead of codeUnits in AhoCorasick.search#32
KirthiSaiT wants to merge 1 commit intomaster-wayne7:developfrom
KirthiSaiT:fix/aho-corasick-runes-codeunits

KirthiSaiT commented Apr 11, 2026 •

edited by master-wayne7

Loading

Uh oh!

coderabbitai bot commented Apr 11, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

master-wayne7 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KirthiSaiT commented Apr 11, 2026 • edited by master-wayne7 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Checklist

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

master-wayne7 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KirthiSaiT commented Apr 11, 2026 •

edited by master-wayne7

Loading

coderabbitai bot commented Apr 11, 2026 •

edited

Loading