[bugfix] Gemma4 suffix: drop trailing newline#9440
Conversation
…arn it The base template's `_add_dynamic_eos` reopens supervision on tokens matching `suffix_tokens_id` after each response. With `suffix = ['<turn|>\n']`, both `<turn|>` (106) and `\n` (107) get supervised, but `\n` belongs to the next user turn's prompt and the official IT model is never trained to predict it. Keep `chat_sep = ['<turn|>\n']` so the multi-turn wire format is unchanged; only `suffix` is shortened to `['<turn|>']`. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request updates the Gemma4TemplateMeta class in swift/template/templates/gemma.py by removing the trailing newline from the suffix field, changing it from ['<turn|> '] to ['<turn|>']. There are no review comments to address, and I have no additional feedback to provide.
|
This is fine and consistent with the Jinja template. |
Yes, the Jinja template would add '<turn|>\n' to every model responses. This is why in code, This PR argues that learning the extra '\n' is unnecessary, which is consistent with previous implementations such as Gemma and DeepSeek. |
|
Either implementation is fine — it has no impact on training outcomes. |
|
It has impact on training due to learning the extra newline. Though I agree that it is a minor impact. |
Summary
Gemma4TemplateMeta.suffixwas['<turn|>\n'], causing the base template's_add_dynamic_eosto reopen supervision on both<turn|>(106) and\n(107) at the end of each response.\nbelongs to the next user turn's prompt; the official Gemma IT model is never trained to predict it (inferred by testing on IT checkpoint), so supervising it is spurious.suffixto['<turn|>'].chat_sepstays['<turn|>\n']so the multi-turn wire format is unchanged — only the supervised position set shrinks by one token per response.Test plan
labelsends with[..., <answer_token>, 106](only<turn|>supervised) instead of[..., <answer_token>, 106, 107].<turn|>\nbetween turns (verifyinput_idsunchanged).🤖 Generated with Claude Code