-
Notifications
You must be signed in to change notification settings - Fork 51
Update default regex to filter out non_ascii tokens #252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -22,6 +22,8 @@ lazy_static! { | |||||||||
| Regex::new(r"\b[0-9a-fA-F]{7,40}\b").expect("Valid git hash regex"), | ||||||||||
| // Markdown/HTML links (URL part must not contain spaces) | ||||||||||
| Regex::new(r"\[([^\]]+)\]\([^\s)]+\)").expect("Valid markdown link regex"), | ||||||||||
| // Non-Ascii characters | ||||||||||
| Regex::new(r"[^\x00-\x7F]+").expect("Valid non-ASCII regex"), | ||||||||||
|
Comment on lines
+25
to
+26
|
||||||||||
| // Non-Ascii characters | |
| Regex::new(r"[^\x00-\x7F]+").expect("Valid non-ASCII regex"), | |
| // Words containing non-ASCII alphabetic characters | |
| Regex::new(r"\p{Alphabetic}*[^\x00-\x7F]+\p{Alphabetic}*").expect("Valid non-ASCII regex"), |
Copilot
AI
Apr 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test indexes patterns[9], which tightly couples it to the exact ordering/length of DEFAULT_SKIP_PATTERNS. Since the pattern list is likely to grow over time, prefer selecting the new pattern in a way that won’t break when another default is inserted earlier (e.g., patterns.last() plus an assertion about what it should match, or asserting the list length before indexing).
| let non_ascii_pattern = &patterns[9]; | |
| let non_ascii_pattern = patterns.last().expect("DEFAULT_SKIP_PATTERNS should include a non-ASCII pattern"); |
Copilot
AI
Apr 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new unit test verifies Regex::is_match, but it doesn’t validate the behavior that matters for #172: that these tokens are actually excluded from extraction/spell-checking. Consider adding a test that runs parser::extract_all_words (or Codebook::spell_check) with the default skip patterns and asserts that non-ASCII words like "简体中文" are not returned as candidates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment capitalization: "Non-Ascii" should be "Non-ASCII" to match the acronym’s standard capitalization (and the error string already uses "non-ASCII").