Update default regex to filter out non_ascii tokens by kunlinglio · Pull Request #252 · blopker/codebook

kunlinglio · 2026-04-27T02:00:50Z

Fixes #172

Changes

Added [^\x00-\x7F]+ regex patterns to the defaults to ignore non-ASCII words.
Added unit test to verify these patterns.

Copilot

Pull request overview

Updates Codebook’s default skip regex list to ignore non-ASCII tokens so they don’t get flagged as spelling errors (fixes #172).

Changes:

Added a new default skip pattern intended to match non-ASCII characters.
Added a unit test validating the new default pattern’s matching behavior.

Copilot · 2026-04-27T02:03:42Z

+        // Non-Ascii characters
+        Regex::new(r"[^\x00-\x7F]+").expect("Valid non-ASCII regex"),


The skip regex [^�-�]+ only matches the non-ASCII substring(s). In the current skip implementation (parser::find_skip_ranges + is_within_skip_range), a word is skipped only if its full byte range is contained within a skip-range; for mixed ASCII/non-ASCII words like "München" this pattern will create a skip-range only around "ü", so the full word will still be spell-checked. If the intention is to ignore any word containing non-ASCII characters, consider changing the pattern to match the entire alphabetic token containing at least one non-ASCII character (e.g., using Unicode properties like \p{Alphabetic}*[^\x00-\x7F]+\p{Alphabetic}*), or adjust the skip logic to skip on overlap instead of full containment.

Suggested change

// Non-Ascii characters

Regex::new(r"[^\x00-\x7F]+").expect("Valid non-ASCII regex"),

// Words containing non-ASCII alphabetic characters

Regex::new(r"\p{Alphabetic}*[^\x00-\x7F]+\p{Alphabetic}*").expect("Valid non-ASCII regex"),

Copilot · 2026-04-27T02:03:43Z

+    #[test]
+    fn test_non_ascii_pattern() {
+        let patterns = get_default_skip_patterns();
+        let non_ascii_pattern = &patterns[9];


This test indexes patterns[9], which tightly couples it to the exact ordering/length of DEFAULT_SKIP_PATTERNS. Since the pattern list is likely to grow over time, prefer selecting the new pattern in a way that won’t break when another default is inserted earlier (e.g., patterns.last() plus an assertion about what it should match, or asserting the list length before indexing).

Suggested change

let non_ascii_pattern = &patterns[9];

let non_ascii_pattern = patterns.last().expect("DEFAULT_SKIP_PATTERNS should include a non-ASCII pattern");

Copilot · 2026-04-27T02:03:43Z

+    #[test]
+    fn test_non_ascii_pattern() {
+        let patterns = get_default_skip_patterns();
+        let non_ascii_pattern = &patterns[9];
+
+        assert!(non_ascii_pattern.is_match("你好世界")); // Chinese
+        assert!(non_ascii_pattern.is_match("こんにちは")); // Japanese - Hiragana
+        assert!(non_ascii_pattern.is_match("日本語")); // Japanese - Kanji
+        assert!(non_ascii_pattern.is_match("안녕하세요")); // Korean - Hangul
+        assert!(non_ascii_pattern.is_match("München")); // German Umlaut
+        assert!(non_ascii_pattern.is_match("étudiant")); // French accent
+        assert!(non_ascii_pattern.is_match("cañón")); // Spanish tilde
+        assert!(!non_ascii_pattern.is_match("Hello World"));


The new unit test verifies Regex::is_match, but it doesn’t validate the behavior that matters for #172: that these tokens are actually excluded from extraction/spell-checking. Consider adding a test that runs parser::extract_all_words (or Codebook::spell_check) with the default skip patterns and asserts that non-ASCII words like "简体中文" are not returned as candidates.

Copilot · 2026-04-27T02:03:43Z

        Regex::new(r"\b[0-9a-fA-F]{7,40}\b").expect("Valid git hash regex"),
        // Markdown/HTML links (URL part must not contain spaces)
        Regex::new(r"\[([^\]]+)\]\([^\s)]+\)").expect("Valid markdown link regex"),
+        // Non-Ascii characters


Comment capitalization: "Non-Ascii" should be "Non-ASCII" to match the acronym’s standard capitalization (and the error string already uses "non-ASCII").

Suggested change

// Non-Ascii characters

// Non-ASCII characters

Update default regex to filter out non_ascii tokens

e1c780b

Copilot AI review requested due to automatic review settings April 27, 2026 02:00

Copilot started reviewing on behalf of kunlinglio April 27, 2026 02:01 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update default regex to filter out non_ascii tokens#252

Update default regex to filter out non_ascii tokens#252
kunlinglio wants to merge 1 commit intoblopker:mainfrom
kunlinglio:feat/non_ascii

kunlinglio commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// Non-Ascii characters
		Regex::new(r"[^\x00-\x7F]+").expect("Valid non-ASCII regex"),

	let non_ascii_pattern = &patterns[9];
	let non_ascii_pattern = patterns.last().expect("DEFAULT_SKIP_PATTERNS should include a non-ASCII pattern");

Conversation

kunlinglio commented Apr 27, 2026

Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants