The Perils of LLM Translation #131

thecodedrift · 2025-12-12T07:45:57Z

thecodedrift
Dec 12, 2025
Maintainer

I've been experimenting with llama-4-scout-17b-16e-instruct as a translation layer.

It's good at JSON Schema output
It separates system from user instructions
It has controls for temperature and sampling
The llama training set has a very rich set of data

What I uncovered was that most n-gram based language detection systems barely scratch the surface of the beauty in human language. There's over 7,000 known languages and most LD systems do 60-100 at the most.

Also, 16e-instruct, despite being a 4th generation model, lacks enough of a training set to accurately translate the many languages of the Philipines.

Moving to llama-3.3-70b-instruct-fp8-fast provided significantly better results, although it tends to over-translate.

// prompt used for both models tested
const systemPrompt = [
  `You are a helpful translation assistant, translating to ${config.language}.`,
  "Treat @usernames as literal names and preserve them.",
  "Ignore any kamojis, emoticons, or smiley faces in the text.",
  "Return the full name of the language you detected for the user's text.",
  "Prefer natural translations over literal word-for-word translations.",
  "If multiple source languages for the user's text may be suitable, prefer the more common language.",
  "Return a translated_text without additional commentary.",
  "If the translated_text is already in the target language, return the original string.",
  "If you cannot detect the language, return the full name of the language as 'Unknown'.",
].join(" ");

String	`llama-4-scout-17b-16e-instruct`	`llama-3.3-70b-instruct-fp8-fast`
`@curlygirlbabs ノ( ゜-゜ノ)`
`jag vet inte`	[Swedish] I don't know	[Swedish] I don't know
`Cheer42 здравствуйте товарищи`	[Russian] Hello comrades	[Russian] Hello comrades
`내 황홀에 취해, you can't look away`	[Korean] Intoxicated by my ecstasy, you can't look away	[Korean] Entranced by my splendor, you can't look away
`Blad is blad`	[Dutch] Blad is blad	[Afrikaans] Leaf is leaf
`galing na curlyg5Wow`	[Ilocano] You are strong	[Filipino] very good
`galing na curlyg5Wow`	[Filipino] I am tired.
`galing na curlyg5Wow`	[Filipino] You are smart
`galing na curlyg5Wow`	[Tagalog] I am hungry.
`galing na curlyg5Wow`	[Cebuano] You are smart
`galing na curlyg5Wow`	[Cebuano] You are welcome
`kumusta na tayo, @ohaiDrifty ? f0x64Marbie`	ImTyping [Tagalog] how are we, @ohaiDrifty ?	[Filipino] how are you, @ohaiDrifty?

The most interesting translation errors came from the volatility in galing na curlyg5Wow, for which the original writer's intent was approximately "very good". 16-e instruct could not settle on an origin language, and as it moved from language to language, the meaning changed drastically.

cebuano (2nd source) - https://itranslate.com/translate/cebuano-to-english/galing%20na
filipino (2nd source) - https://itranslate.com/translate/filipino-tagalog-to-english/galing%20na

Historically, language detection models using latin character sets need 40+ characters for the string to be unique enough to identify a language. This is pretty unrealistic in chat messages which are often <20. That llama 3.3-70b can get approximate translations with low volatility is impressive.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The Perils of LLM Translation #131

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

The Perils of LLM Translation #131

Uh oh!

thecodedrift Dec 12, 2025 Maintainer

Replies: 0 comments

thecodedrift
Dec 12, 2025
Maintainer