Support add_generation_prompt request parameter for chat completions (#3877)#4331
Open
exzile wants to merge 1 commit into
Open
Support add_generation_prompt request parameter for chat completions (#3877)#4331exzile wants to merge 1 commit into
exzile wants to merge 1 commit into
Conversation
The chat template was always rendered with add_generation_prompt=true, hardcoded in every servable. This exposes an optional add_generation_prompt field (bool, default true) on the /v3/chat/completions request, matching HF transformers and vLLM. When false, the trailing generation prompt is omitted, which is the building block for assistant prefill. - Parse add_generation_prompt in the request (openai_api_handler.cpp) and store it on the request struct (openai_request.hpp). - Honor it in all chat-template application sites: the MINJA path (LLM and VLM continuous batching, legacy) and the Python-Jinja path (read from the request body and pass into the template render). - Add tests covering default (generation prompt added) and false (generation prompt omitted, assistant message preserved). Verified end-to-end on the MINJA path with HuggingFaceTB/SmolLM2-360M-Instruct: default renders a trailing "<|im_start|>assistant", add_generation_prompt=false omits it. Note: true assistant prefill (continue_final_message - continuing from the final assistant message without closing it) is a separate control and is left as a follow-up. Implements openvinotoolkit#3877 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
91043be to
6bb8bd4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #3877.
The chat template was always rendered with
add_generation_prompt=true, hardcoded in every servable. This exposes an optionaladd_generation_promptfield (bool, defaulttrue) on/v3/chat/completions, matching HF transformers and vLLM. Whenfalse, the trailing generation prompt is omitted — the building block for assistant prefill.Changes
add_generation_promptin the request handler; store it on the request struct.add_generation_prompt = true.Testing
Added tests covering default (generation prompt added) and
false(generation prompt omitted, assistant message preserved).Verified end-to-end on the MINJA path with
HuggingFaceTB/SmolLM2-360M-Instruct:<|im_start|>assistantadd_generation_prompt: false→ that trailing generation prompt is omittedScope note
This implements
add_generation_promptonly. Full assistant prefill — continuing from the final assistant message without closing it (continue_final_messagein transformers/vLLM) — is a separate control that the genai C++apply_chat_templatedoes not currently expose, and is left as a follow-up.🤖 Generated with Claude Code