Skip to content

Update default regex to filter out non_ascii tokens#252

Open
kunlinglio wants to merge 1 commit intoblopker:mainfrom
kunlinglio:feat/non_ascii
Open

Update default regex to filter out non_ascii tokens#252
kunlinglio wants to merge 1 commit intoblopker:mainfrom
kunlinglio:feat/non_ascii

Conversation

@kunlinglio
Copy link
Copy Markdown

Fixes #172

Changes

  • Added [^\x00-\x7F]+ regex patterns to the defaults to ignore non-ASCII words.
  • Added unit test to verify these patterns.

Copilot AI review requested due to automatic review settings April 27, 2026 02:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates Codebook’s default skip regex list to ignore non-ASCII tokens so they don’t get flagged as spelling errors (fixes #172).

Changes:

  • Added a new default skip pattern intended to match non-ASCII characters.
  • Added a unit test validating the new default pattern’s matching behavior.

Comment on lines +25 to +26
// Non-Ascii characters
Regex::new(r"[^\x00-\x7F]+").expect("Valid non-ASCII regex"),
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The skip regex [^�-�]+ only matches the non-ASCII substring(s). In the current skip implementation (parser::find_skip_ranges + is_within_skip_range), a word is skipped only if its full byte range is contained within a skip-range; for mixed ASCII/non-ASCII words like "München" this pattern will create a skip-range only around "ü", so the full word will still be spell-checked. If the intention is to ignore any word containing non-ASCII characters, consider changing the pattern to match the entire alphabetic token containing at least one non-ASCII character (e.g., using Unicode properties like \p{Alphabetic}*[^\x00-\x7F]+\p{Alphabetic}*), or adjust the skip logic to skip on overlap instead of full containment.

Suggested change
// Non-Ascii characters
Regex::new(r"[^\x00-\x7F]+").expect("Valid non-ASCII regex"),
// Words containing non-ASCII alphabetic characters
Regex::new(r"\p{Alphabetic}*[^\x00-\x7F]+\p{Alphabetic}*").expect("Valid non-ASCII regex"),

Copilot uses AI. Check for mistakes.
#[test]
fn test_non_ascii_pattern() {
let patterns = get_default_skip_patterns();
let non_ascii_pattern = &patterns[9];
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test indexes patterns[9], which tightly couples it to the exact ordering/length of DEFAULT_SKIP_PATTERNS. Since the pattern list is likely to grow over time, prefer selecting the new pattern in a way that won’t break when another default is inserted earlier (e.g., patterns.last() plus an assertion about what it should match, or asserting the list length before indexing).

Suggested change
let non_ascii_pattern = &patterns[9];
let non_ascii_pattern = patterns.last().expect("DEFAULT_SKIP_PATTERNS should include a non-ASCII pattern");

Copilot uses AI. Check for mistakes.
Comment on lines +88 to +100
#[test]
fn test_non_ascii_pattern() {
let patterns = get_default_skip_patterns();
let non_ascii_pattern = &patterns[9];

assert!(non_ascii_pattern.is_match("你好世界")); // Chinese
assert!(non_ascii_pattern.is_match("こんにちは")); // Japanese - Hiragana
assert!(non_ascii_pattern.is_match("日本語")); // Japanese - Kanji
assert!(non_ascii_pattern.is_match("안녕하세요")); // Korean - Hangul
assert!(non_ascii_pattern.is_match("München")); // German Umlaut
assert!(non_ascii_pattern.is_match("étudiant")); // French accent
assert!(non_ascii_pattern.is_match("cañón")); // Spanish tilde
assert!(!non_ascii_pattern.is_match("Hello World"));
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new unit test verifies Regex::is_match, but it doesn’t validate the behavior that matters for #172: that these tokens are actually excluded from extraction/spell-checking. Consider adding a test that runs parser::extract_all_words (or Codebook::spell_check) with the default skip patterns and asserts that non-ASCII words like "简体中文" are not returned as candidates.

Copilot uses AI. Check for mistakes.
Regex::new(r"\b[0-9a-fA-F]{7,40}\b").expect("Valid git hash regex"),
// Markdown/HTML links (URL part must not contain spaces)
Regex::new(r"\[([^\]]+)\]\([^\s)]+\)").expect("Valid markdown link regex"),
// Non-Ascii characters
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment capitalization: "Non-Ascii" should be "Non-ASCII" to match the acronym’s standard capitalization (and the error string already uses "non-ASCII").

Suggested change
// Non-Ascii characters
// Non-ASCII characters

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: Some non-ASCII characters are incorrectly flagged as spelling errors

2 participants