Skip to content

Support add_generation_prompt request parameter for chat completions (#3877)#4331

Open
exzile wants to merge 1 commit into
openvinotoolkit:mainfrom
exzile:feature/assistant-prefill
Open

Support add_generation_prompt request parameter for chat completions (#3877)#4331
exzile wants to merge 1 commit into
openvinotoolkit:mainfrom
exzile:feature/assistant-prefill

Conversation

@exzile

@exzile exzile commented Jun 26, 2026

Copy link
Copy Markdown

Summary

Closes #3877.

The chat template was always rendered with add_generation_prompt=true, hardcoded in every servable. This exposes an optional add_generation_prompt field (bool, default true) on /v3/chat/completions, matching HF transformers and vLLM. When false, the trailing generation prompt is omitted — the building block for assistant prefill.

Changes

  • Parse and validate add_generation_prompt in the request handler; store it on the request struct.
  • Honor it at every chat-template application site: the MINJA path (LLM and VLM continuous batching, legacy) and the Python-Jinja path (read from the request body and passed into the template render).
  • Replaces the previously hardcoded add_generation_prompt = true.

Testing

Added tests covering default (generation prompt added) and false (generation prompt omitted, assistant message preserved).

Verified end-to-end on the MINJA path with HuggingFaceTB/SmolLM2-360M-Instruct:

  • default → prompt ends with <|im_start|>assistant
  • add_generation_prompt: false → that trailing generation prompt is omitted

Scope note

This implements add_generation_prompt only. Full assistant prefill — continuing from the final assistant message without closing it (continue_final_message in transformers/vLLM) — is a separate control that the genai C++ apply_chat_template does not currently expose, and is left as a follow-up.

🤖 Generated with Claude Code

The chat template was always rendered with add_generation_prompt=true,
hardcoded in every servable. This exposes an optional add_generation_prompt
field (bool, default true) on the /v3/chat/completions request, matching
HF transformers and vLLM. When false, the trailing generation prompt is
omitted, which is the building block for assistant prefill.

- Parse add_generation_prompt in the request (openai_api_handler.cpp) and
  store it on the request struct (openai_request.hpp).
- Honor it in all chat-template application sites: the MINJA path (LLM and
  VLM continuous batching, legacy) and the Python-Jinja path (read from the
  request body and pass into the template render).
- Add tests covering default (generation prompt added) and false
  (generation prompt omitted, assistant message preserved).

Verified end-to-end on the MINJA path with HuggingFaceTB/SmolLM2-360M-Instruct:
default renders a trailing "<|im_start|>assistant", add_generation_prompt=false
omits it.

Note: true assistant prefill (continue_final_message - continuing from the
final assistant message without closing it) is a separate control and is left
as a follow-up.

Implements openvinotoolkit#3877

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@exzile exzile force-pushed the feature/assistant-prefill branch from 91043be to 6bb8bd4 Compare June 27, 2026 01:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Allow Assistant Messages to Act as Generation Prefix (Prefill)

1 participant