Support add_generation_prompt request parameter for chat completions (#3877) by exzile · Pull Request #4331 · openvinotoolkit/model_server

exzile · 2026-06-26T20:07:36Z

Summary

Closes #3877.

The chat template was always rendered with add_generation_prompt=true, hardcoded in every servable. This exposes an optional add_generation_prompt field (bool, default true) on /v3/chat/completions, matching HF transformers and vLLM. When false, the trailing generation prompt is omitted — the building block for assistant prefill.

Changes

Parse and validate add_generation_prompt in the request handler; store it on the request struct.
Honor it at every chat-template application site: the MINJA path (LLM and VLM continuous batching, legacy) and the Python-Jinja path (read from the request body and passed into the template render).
Replaces the previously hardcoded add_generation_prompt = true.

Testing

Added tests covering default (generation prompt added) and false (generation prompt omitted, assistant message preserved).

Verified end-to-end on the MINJA path with HuggingFaceTB/SmolLM2-360M-Instruct:

default → prompt ends with <|im_start|>assistant
add_generation_prompt: false → that trailing generation prompt is omitted

Scope note

This implements add_generation_prompt only. Full assistant prefill — continuing from the final assistant message without closing it (continue_final_message in transformers/vLLM) — is a separate control that the genai C++ apply_chat_template does not currently expose, and is left as a follow-up.

🤖 Generated with Claude Code

The chat template was always rendered with add_generation_prompt=true, hardcoded in every servable. This exposes an optional add_generation_prompt field (bool, default true) on the /v3/chat/completions request, matching HF transformers and vLLM. When false, the trailing generation prompt is omitted, which is the building block for assistant prefill. - Parse add_generation_prompt in the request (openai_api_handler.cpp) and store it on the request struct (openai_request.hpp). - Honor it in all chat-template application sites: the MINJA path (LLM and VLM continuous batching, legacy) and the Python-Jinja path (read from the request body and pass into the template render). - Add tests covering default (generation prompt added) and false (generation prompt omitted, assistant message preserved). Verified end-to-end on the MINJA path with HuggingFaceTB/SmolLM2-360M-Instruct: default renders a trailing "<|im_start|>assistant", add_generation_prompt=false omits it. Note: true assistant prefill (continue_final_message - continuing from the final assistant message without closing it) is a separate control and is left as a follow-up. Implements openvinotoolkit#3877 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

exzile mentioned this pull request Jun 26, 2026

Feature Request: Allow Assistant Messages to Act as Generation Prefix (Prefill) #3877

Open

exzile force-pushed the feature/assistant-prefill branch from 91043be to 6bb8bd4 Compare June 27, 2026 01:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support add_generation_prompt request parameter for chat completions (#3877)#4331

Support add_generation_prompt request parameter for chat completions (#3877)#4331
exzile wants to merge 1 commit into
openvinotoolkit:mainfrom
exzile:feature/assistant-prefill

exzile commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

exzile commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Scope note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

exzile commented Jun 26, 2026 •

edited

Loading