Skip to content

feat(serve): add --generation-config CLI for server sampling defaults#4708

Open
lvhan028 wants to merge 1 commit into
InternLM:mainfrom
lvhan028:feat/generation-config-cli
Open

feat(serve): add --generation-config CLI for server sampling defaults#4708
lvhan028 wants to merge 1 commit into
InternLM:mainfrom
lvhan028:feat/generation-config-cli

Conversation

@lvhan028

@lvhan028 lvhan028 commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Motivation

LMDeploy's OpenAI-compatible API server previously relied on hard-coded protocol defaults for sampling parameters (e.g. temperature=0.7, top_k=40). This made it difficult to align server behavior with a model's own HuggingFace generation_config.json, which many models ship with recommended decoding settings.

vLLM exposes a similar --generation-config flag to load HF generation defaults at startup. This PR brings comparable behavior to LMDeploy:

  • Load a model's HF generation_config.json as server-side defaults when users omit sampling fields in requests.
  • Allow opting out via --generation-config lmdeploy to keep LMDeploy/GenerationConfig defaults.
  • Support loading from a custom folder path when needed.

To support proper merge semantics, protocol sampling fields are changed to default to None (meaning "not specified by the user") instead of fixed values. Unspecified fields fall back to HF defaults (if loaded) and then to GenerationConfig dataclass defaults.

Modification

CLI

  • Add --generation-config to lmdeploy serve api_server (default: auto).
    • auto: load generation_config.json from the model path via HuggingFace GenerationConfig.from_pretrained().
    • lmdeploy: do not load HF config; use LMDeploy defaults only.
    • <path>: load from a custom directory.

Core module (lmdeploy/serve/core/generation_config.py)

Introduce a small helper module

API server integration

  • Parse --generation-config once at startup and store the result in VariableInterface.default_gen_config.
  • Route chat completions, completions, generate, Responses API, and Anthropic endpoints through build_generation_config() for consistent merge behavior.
  • Remove redundant same-name pass-through kwargs at call sites; keep only renamed (stopstop_words, seedrandom_seed), computed (logprobs), or raw-json fields (migration_request, with_cache, preserve_cache) in extra_kwargs.

Protocol changes

  • Set sampling-related fields in OpenAI/Responses/Anthropic protocols to None defaults (e.g. temperature, top_p, top_k, repetition_penalty, min_p) so "user did not send" can be distinguished from an explicit value.

Behavior notes

  • When --generation-config lmdeploy and the user omits sampling params, defaults come from GenerationConfig (temperature=0.8, top_k=50, etc.), not the old protocol defaults (0.7 / 40).
  • max_new_tokens is not taken from HF config; it is resolved from max_completion_tokens / max_tokens on the request, with engine-level fallback when unset.
  • do_sample=True is always set for serving requests.

Copilot AI review requested due to automatic review settings June 25, 2026 08:58

This comment was marked as outdated.

@lvhan028 lvhan028 force-pushed the feat/generation-config-cli branch from 4d9cfa8 to 1cb9465 Compare July 2, 2026 03:29
Load HuggingFace generation_config.json as server-side defaults when
requests omit sampling fields, with merge priority request > HF config >
GenerationConfig defaults. Filter unsupported HF keys before building
GenerationConfig, extract explicit request overrides via exclude_unset,
and align /generate sampling protocol defaults with other endpoints.

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.

Comment thread lmdeploy/serve/core/generation_config.py
# while leaving plain Pydantic defaults available for server defaults.
return {
key: value
for key, value in request.model_dump(exclude_unset=True).items()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since ResponsesRequest uses extra='allow' at lmdeploy/serve/openai/responses/protocol.py:36, /v1/responses clients can send unsupported GenerationConfig fields like return_ppl, with_cache, migration_request, output_logits, or stop_token_ids, and they flow into the engine. I confirmed return_ppl=True, with_cache=True, stop_token_ids=[1] becomes active in GenerationConfig. Please restrict extraction to declared request fields or an endpoint-specific allowlist.

from agent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants