feat(serve): add --generation-config CLI for server sampling defaults#4708
Open
lvhan028 wants to merge 1 commit into
Open
feat(serve): add --generation-config CLI for server sampling defaults#4708lvhan028 wants to merge 1 commit into
lvhan028 wants to merge 1 commit into
Conversation
4d9cfa8 to
1cb9465
Compare
Load HuggingFace generation_config.json as server-side defaults when requests omit sampling fields, with merge priority request > HF config > GenerationConfig defaults. Filter unsupported HF keys before building GenerationConfig, extract explicit request overrides via exclude_unset, and align /generate sampling protocol defaults with other endpoints. Co-authored-by: Cursor <cursoragent@cursor.com>
7e46bed to
2ba421b
Compare
grimoire
reviewed
Jul 2, 2026
| # while leaving plain Pydantic defaults available for server defaults. | ||
| return { | ||
| key: value | ||
| for key, value in request.model_dump(exclude_unset=True).items() |
Collaborator
There was a problem hiding this comment.
Since ResponsesRequest uses extra='allow' at lmdeploy/serve/openai/responses/protocol.py:36, /v1/responses clients can send unsupported GenerationConfig fields like return_ppl, with_cache, migration_request, output_logits, or stop_token_ids, and they flow into the engine. I confirmed return_ppl=True, with_cache=True, stop_token_ids=[1] becomes active in GenerationConfig. Please restrict extraction to declared request fields or an endpoint-specific allowlist.
from agent.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
LMDeploy's OpenAI-compatible API server previously relied on hard-coded protocol defaults for sampling parameters (e.g.
temperature=0.7,top_k=40). This made it difficult to align server behavior with a model's own HuggingFacegeneration_config.json, which many models ship with recommended decoding settings.vLLM exposes a similar
--generation-configflag to load HF generation defaults at startup. This PR brings comparable behavior to LMDeploy:generation_config.jsonas server-side defaults when users omit sampling fields in requests.--generation-config lmdeployto keep LMDeploy/GenerationConfigdefaults.To support proper merge semantics, protocol sampling fields are changed to default to
None(meaning "not specified by the user") instead of fixed values. Unspecified fields fall back to HF defaults (if loaded) and then toGenerationConfigdataclass defaults.Modification
CLI
--generation-configtolmdeploy serve api_server(default:auto).auto: loadgeneration_config.jsonfrom the model path via HuggingFaceGenerationConfig.from_pretrained().lmdeploy: do not load HF config; use LMDeploy defaults only.<path>: load from a custom directory.Core module (
lmdeploy/serve/core/generation_config.py)Introduce a small helper module
API server integration
--generation-configonce at startup and store the result inVariableInterface.default_gen_config.build_generation_config()for consistent merge behavior.stop→stop_words,seed→random_seed), computed (logprobs), or raw-json fields (migration_request,with_cache,preserve_cache) inextra_kwargs.Protocol changes
Nonedefaults (e.g.temperature,top_p,top_k,repetition_penalty,min_p) so "user did not send" can be distinguished from an explicit value.Behavior notes
--generation-config lmdeployand the user omits sampling params, defaults come fromGenerationConfig(temperature=0.8,top_k=50, etc.), not the old protocol defaults (0.7/40).max_new_tokensis not taken from HF config; it is resolved frommax_completion_tokens/max_tokenson the request, with engine-level fallback when unset.do_sample=Trueis always set for serving requests.