Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions crates/codebook/src/regexes.rs
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ lazy_static! {
Regex::new(r"\b[0-9a-fA-F]{7,40}\b").expect("Valid git hash regex"),
// Markdown/HTML links (URL part must not contain spaces)
Regex::new(r"\[([^\]]+)\]\([^\s)]+\)").expect("Valid markdown link regex"),
// Non-Ascii characters
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment capitalization: "Non-Ascii" should be "Non-ASCII" to match the acronym’s standard capitalization (and the error string already uses "non-ASCII").

Suggested change
// Non-Ascii characters
// Non-ASCII characters

Copilot uses AI. Check for mistakes.
Regex::new(r"[^\x00-\x7F]+").expect("Valid non-ASCII regex"),
Comment on lines +25 to +26
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The skip regex [^�-�]+ only matches the non-ASCII substring(s). In the current skip implementation (parser::find_skip_ranges + is_within_skip_range), a word is skipped only if its full byte range is contained within a skip-range; for mixed ASCII/non-ASCII words like "München" this pattern will create a skip-range only around "ü", so the full word will still be spell-checked. If the intention is to ignore any word containing non-ASCII characters, consider changing the pattern to match the entire alphabetic token containing at least one non-ASCII character (e.g., using Unicode properties like \p{Alphabetic}*[^\x00-\x7F]+\p{Alphabetic}*), or adjust the skip logic to skip on overlap instead of full containment.

Suggested change
// Non-Ascii characters
Regex::new(r"[^\x00-\x7F]+").expect("Valid non-ASCII regex"),
// Words containing non-ASCII alphabetic characters
Regex::new(r"\p{Alphabetic}*[^\x00-\x7F]+\p{Alphabetic}*").expect("Valid non-ASCII regex"),

Copilot uses AI. Check for mistakes.
];
}

Expand Down Expand Up @@ -83,4 +85,19 @@ mod tests {
assert!(email_pattern.is_match("test.email+tag@domain.co.uk"));
assert!(!email_pattern.is_match("not an email"));
}
#[test]
fn test_non_ascii_pattern() {
let patterns = get_default_skip_patterns();
let non_ascii_pattern = &patterns[9];
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test indexes patterns[9], which tightly couples it to the exact ordering/length of DEFAULT_SKIP_PATTERNS. Since the pattern list is likely to grow over time, prefer selecting the new pattern in a way that won’t break when another default is inserted earlier (e.g., patterns.last() plus an assertion about what it should match, or asserting the list length before indexing).

Suggested change
let non_ascii_pattern = &patterns[9];
let non_ascii_pattern = patterns.last().expect("DEFAULT_SKIP_PATTERNS should include a non-ASCII pattern");

Copilot uses AI. Check for mistakes.

assert!(non_ascii_pattern.is_match("你好世界")); // Chinese
assert!(non_ascii_pattern.is_match("こんにちは")); // Japanese - Hiragana
assert!(non_ascii_pattern.is_match("日本語")); // Japanese - Kanji
assert!(non_ascii_pattern.is_match("안녕하세요")); // Korean - Hangul
assert!(non_ascii_pattern.is_match("München")); // German Umlaut
assert!(non_ascii_pattern.is_match("étudiant")); // French accent
assert!(non_ascii_pattern.is_match("cañón")); // Spanish tilde
assert!(!non_ascii_pattern.is_match("Hello World"));
Comment on lines +88 to +100
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new unit test verifies Regex::is_match, but it doesn’t validate the behavior that matters for #172: that these tokens are actually excluded from extraction/spell-checking. Consider adding a test that runs parser::extract_all_words (or Codebook::spell_check) with the default skip patterns and asserts that non-ASCII words like "简体中文" are not returned as candidates.

Copilot uses AI. Check for mistakes.
assert!(!non_ascii_pattern.is_match("1234567890!@#$%^&*()"));
}
}
Loading