Update default regex to filter out non_ascii tokens#252
Update default regex to filter out non_ascii tokens#252kunlinglio wants to merge 1 commit intoblopker:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Updates Codebook’s default skip regex list to ignore non-ASCII tokens so they don’t get flagged as spelling errors (fixes #172).
Changes:
- Added a new default skip pattern intended to match non-ASCII characters.
- Added a unit test validating the new default pattern’s matching behavior.
| // Non-Ascii characters | ||
| Regex::new(r"[^\x00-\x7F]+").expect("Valid non-ASCII regex"), |
There was a problem hiding this comment.
The skip regex [^�-�]+ only matches the non-ASCII substring(s). In the current skip implementation (parser::find_skip_ranges + is_within_skip_range), a word is skipped only if its full byte range is contained within a skip-range; for mixed ASCII/non-ASCII words like "München" this pattern will create a skip-range only around "ü", so the full word will still be spell-checked. If the intention is to ignore any word containing non-ASCII characters, consider changing the pattern to match the entire alphabetic token containing at least one non-ASCII character (e.g., using Unicode properties like \p{Alphabetic}*[^\x00-\x7F]+\p{Alphabetic}*), or adjust the skip logic to skip on overlap instead of full containment.
| // Non-Ascii characters | |
| Regex::new(r"[^\x00-\x7F]+").expect("Valid non-ASCII regex"), | |
| // Words containing non-ASCII alphabetic characters | |
| Regex::new(r"\p{Alphabetic}*[^\x00-\x7F]+\p{Alphabetic}*").expect("Valid non-ASCII regex"), |
| #[test] | ||
| fn test_non_ascii_pattern() { | ||
| let patterns = get_default_skip_patterns(); | ||
| let non_ascii_pattern = &patterns[9]; |
There was a problem hiding this comment.
This test indexes patterns[9], which tightly couples it to the exact ordering/length of DEFAULT_SKIP_PATTERNS. Since the pattern list is likely to grow over time, prefer selecting the new pattern in a way that won’t break when another default is inserted earlier (e.g., patterns.last() plus an assertion about what it should match, or asserting the list length before indexing).
| let non_ascii_pattern = &patterns[9]; | |
| let non_ascii_pattern = patterns.last().expect("DEFAULT_SKIP_PATTERNS should include a non-ASCII pattern"); |
| #[test] | ||
| fn test_non_ascii_pattern() { | ||
| let patterns = get_default_skip_patterns(); | ||
| let non_ascii_pattern = &patterns[9]; | ||
|
|
||
| assert!(non_ascii_pattern.is_match("你好世界")); // Chinese | ||
| assert!(non_ascii_pattern.is_match("こんにちは")); // Japanese - Hiragana | ||
| assert!(non_ascii_pattern.is_match("日本語")); // Japanese - Kanji | ||
| assert!(non_ascii_pattern.is_match("안녕하세요")); // Korean - Hangul | ||
| assert!(non_ascii_pattern.is_match("München")); // German Umlaut | ||
| assert!(non_ascii_pattern.is_match("étudiant")); // French accent | ||
| assert!(non_ascii_pattern.is_match("cañón")); // Spanish tilde | ||
| assert!(!non_ascii_pattern.is_match("Hello World")); |
There was a problem hiding this comment.
The new unit test verifies Regex::is_match, but it doesn’t validate the behavior that matters for #172: that these tokens are actually excluded from extraction/spell-checking. Consider adding a test that runs parser::extract_all_words (or Codebook::spell_check) with the default skip patterns and asserts that non-ASCII words like "简体中文" are not returned as candidates.
| Regex::new(r"\b[0-9a-fA-F]{7,40}\b").expect("Valid git hash regex"), | ||
| // Markdown/HTML links (URL part must not contain spaces) | ||
| Regex::new(r"\[([^\]]+)\]\([^\s)]+\)").expect("Valid markdown link regex"), | ||
| // Non-Ascii characters |
There was a problem hiding this comment.
Comment capitalization: "Non-Ascii" should be "Non-ASCII" to match the acronym’s standard capitalization (and the error string already uses "non-ASCII").
| // Non-Ascii characters | |
| // Non-ASCII characters |
Fixes #172
Changes
[^\x00-\x7F]+regex patterns to the defaults to ignore non-ASCII words.