Skip to content

Conversation

@dgisser
Copy link
Contributor

@dgisser dgisser commented Dec 22, 2025

Restore Japanese Kana With Dakuten

  • Added a regression test showing that Japanese kana like パピプペポ / ぱぴぷぺぽ currently get normalized to their unvoiced forms (ハヒフヘホ / はひふへほ) by the diacritic stripping pass.
  • Updated the normalization filter (src/censor.rs:159-175) so the combining dakuten (\u{3099}) and handakuten (\u{309A}) marks are preserved while still removing other nonspacing marks and banned characters. This keeps Japanese kana intact without loosening other filtering behavior.

Testing

Before fix:

cargo +nightly test japanese_diacritics_preserved -- --nocapture
...
thread 'censor::tests::japanese_diacritics_preserved' panicked
  left: "パピプペポ"
  right: "ハヒフヘホ"

After fix:

cargo +nightly test japanese_diacritics_preserved -- --nocapture
running 1 test
test censor::tests::japanese_diacritics_preserved ... ok

Full suite:

  • cargo +nightly test censor

@dgisser dgisser force-pushed the main branch 2 times, most recently from ec27d63 to e84841b Compare December 22, 2025 04:31
@finnbear finnbear merged commit 85b2334 into finnbear:main Dec 22, 2025
1 of 2 checks passed
@finnbear
Copy link
Owner

finnbear commented Dec 22, 2025

Looks great, thanks! I did initially suggest detecting Japanese (like Devanagari) but this is a better solution as there are only 2 diacritics and they're hard to abuse. This change will take effect in our games next time I update the filter.

@dgisser
Copy link
Contributor Author

dgisser commented Dec 22, 2025

for sure, let me know when it pushes to prod and I'll test it out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants