Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to extend stop-word handling to support multiple stop words and multi-token stop sequences across LMDeploy’s configs and runtime (TurboMind, PyTorch, and streaming AsyncEngine).
Changes:
- Normalize/convert stop words into token-id sequences (
list[list[int]]) and update tokenization helpers accordingly. - Update TurboMind and PyTorch pipelines to carry multi-stop metadata (packed stop/bad words for TM;
stop_word_lensfor PyTorch). - Add streaming-time multi-stop detection/holdback logic in
AsyncEngine.generate().
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| lmdeploy/utils.py | Change _stop_words to return token-id sequences instead of a packed numpy structure. |
| lmdeploy/turbomind/turbomind.py | Pack stop/bad word sequences into TurboMind’s (tokens, offsets) format; adjust eos/stop/bad wiring. |
| lmdeploy/tokenizer.py | Make indexes_containing_token() return empty for multi-token encodings so callers can fall back to encode(). |
| lmdeploy/serve/core/async_engine.py | Implement streaming multi-stop matching with tail holdback; refactor stop-id handling. |
| lmdeploy/pytorch/strategies/dllm/sampling.py | Propagate renamed sampling input attribute from stop_mask to stop_word_lens. |
| lmdeploy/pytorch/strategies/ar/sampling.py | Build batched stop-word tensors for multi-token sequences plus per-sequence lengths. |
| lmdeploy/pytorch/strategies/ar/model_agent.py | Add vectorized multi-token stop detection spanning decode steps via a maintained tail. |
| lmdeploy/pytorch/messages.py | Update PyTorch SamplingParam.stop_words to be sequence-based and adjust ignore_eos behavior. |
| lmdeploy/pytorch/engine/model_agent/agent.py | Pass output_token_ids into stopping criteria and thread through stop_word_lens. |
| lmdeploy/pytorch/engine/logits_process.py | Update SamplingInputs to use stop_word_lens and only mask single-token stops during ignore-eos. |
| lmdeploy/messages.py | Change GenerationConfig.stop_token_ids to support nested sequences; add normalization and updated conversion logic. |
Comments suppressed due to low confidence (1)
lmdeploy/serve/core/async_engine.py:383
generate()now calls_determine_gen_config(session, input_ids, ...)beforeinput_idsis populated frommessagesviaprompt_processor.get_prompt_input(). Whenmessagesis provided (the common path),input_idsis stillNonehere and_determine_gen_configwill raise when it doeslen(input_ids)and other processing. Move this call back to afterinput_idsis available, or change_determine_gen_configto acceptNoneand defer max_new_tokens computation until after tokenization.
gen_config = self._determine_gen_config(session, input_ids, gen_config=gen_config)
if messages:
prompt = messages
self.request_logger.log_prompt(session, prompt=prompt)
prompt_input = await self.prompt_processor.get_prompt_input(prompt=prompt,
do_preprocess=do_preprocess,
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| # This assumes the engine will stop when stop token is hit | ||
| if output_len and outputs.token_ids[-1] in stop_ids: | ||
| # print(f'outputs.token_ids: {outputs.token_ids}') |
There was a problem hiding this comment.
Please remove the commented-out debug statement (# print(f'outputs.token_ids: ...')). Leaving debug prints in production code adds noise and can confuse future readers when troubleshooting stop-word logic.
| # print(f'outputs.token_ids: {outputs.token_ids}') |
| # stopping criteria | ||
| # Use output_token_ids (all tokens accepted this step) so that multi-token | ||
| # stop sequences whose last token is not the final spec-decoded token are | ||
| # detected correctly. For non-spec AR, output_token_ids == next_token_ids. | ||
| stopped, stop_pos, stopping_criteria = stopping_criteria.step( | ||
| next_token_ids, | ||
| output_token_ids, | ||
| sampling_inputs.stop_words, | ||
| inputs=inputs, | ||
| extra_inputs=extra_inputs, | ||
| stop_word_lens=sampling_inputs.stop_word_lens, | ||
| ) |
There was a problem hiding this comment.
stopping_criteria.step(...) is now called with the new keyword argument stop_word_lens=..., but most StoppingCriteria implementations (e.g. ARSpecStoppingCriteria.step, DLLMStoppingCriteria.step, and the abstract base signature) do not accept this parameter. This will raise TypeError: got an unexpected keyword argument 'stop_word_lens' when using non-AR strategies. Either update the base StoppingCriteria.step contract (and all implementations) to accept stop_word_lens (or **kwargs), or only pass stop_word_lens when the concrete stopping criteria supports it.
| def _stop_words(stop_words: list[int | str], tokenizer: object) -> list[list[int]] | None: | ||
| """Convert chat-template stop words to List[List[int]]. | ||
|
|
||
| Each element is a token ID sequence representing one stop word. Single-token matches from vocab scan produce | ||
| length-1 lists. Multi-token words that require encoding produce longer lists. | ||
| """ | ||
| if stop_words is None: | ||
| return None | ||
| assert isinstance(stop_words, list) and \ |
There was a problem hiding this comment.
The _stop_words parameter type is declared as list[int | str], but the function explicitly supports None (if stop_words is None: return None) and is called with self.chat_template.stop_words which can be None. Please update the signature to reflect this (e.g., stop_words: list[int | str] | None) so callers and type-checkers match runtime behavior.
| if key not in seen: | ||
| seen.add(key) | ||
| deduped.append(seq) | ||
| self.stop_token_ids = deduped or None |
There was a problem hiding this comment.
GenerationConfig.convert_stop_bad_words_to_ids() now mutates self.stop_token_ids to a nested list[list[int]]. Since GenerationConfig is a public config object (and previously exposed stop_token_ids as List[int]), this is a backward-incompatible behavioral change for downstream code that inspects or serializes stop_token_ids. Consider either (a) keeping stop_token_ids externally as list[int] when only single-token stops are present, (b) introducing a new field for multi-token stop sequences, or (c) clearly documenting this change and providing a helper accessor that returns the legacy flattened single-token view.
| self.stop_token_ids = deduped or None | |
| # Preserve backward-compatible shape: use flat list[int] when only | |
| # single-token stop sequences are present; otherwise keep nested list[list[int]]. | |
| if deduped and all(len(seq) == 1 for seq in deduped): | |
| self.stop_token_ids = [seq[0] for seq in deduped] or None | |
| else: | |
| self.stop_token_ids = deduped or None |
| # During min_new_tokens period suppress single-token stops as bad words. | ||
| bw = bw + [s[0] for s in sw if len(s) == 1] | ||
| bad_words[idx] = bw | ||
| stop_words[idx] = sw |
There was a problem hiding this comment.
During the min_new_tokens period, the code only suppresses single-token stop sequences by adding them to bad_words. Multi-token stop sequences are not suppressed, so the model can still generate them and StoppingCriteria.step() can stop early, violating the min_new_tokens guarantee. Consider disabling stop-word checks entirely until seq.num_new_tokens >= min_new_tokens (or implementing a multi-token-aware suppression strategy) so min_new_tokens is honored for all stop sequences.
| # During min_new_tokens period suppress single-token stops as bad words. | |
| bw = bw + [s[0] for s in sw if len(s) == 1] | |
| bad_words[idx] = bw | |
| stop_words[idx] = sw | |
| # During min_new_tokens period suppress single-token stops as bad words | |
| # and disable stop-word based early stopping entirely. | |
| bw = bw + [s[0] for s in sw if len(s) == 1] | |
| effective_sw = [] | |
| else: | |
| effective_sw = sw | |
| bad_words[idx] = bw | |
| stop_words[idx] = effective_sw |
|
|
||
| n: int = 1 | ||
| max_new_tokens: int = 512 | ||
| max_new_tokens: int = None |
There was a problem hiding this comment.
GenerationConfig.max_new_tokens now defaults to None, but other code paths construct GenerationConfig() and immediately treat max_new_tokens as an int. For example, lmdeploy/pytorch/engine/engine_instance.py creates gen_config = gen_config or GenerationConfig() and SamplingParam.from_gen_config() then does numeric comparisons (max_new_tokens < 0, etc.), which will raise a TypeError when max_new_tokens is None. Either keep the default as an int (e.g., 512) or update all consumers (not just AsyncEngine._determine_gen_config) to handle None safely.
| max_new_tokens: int = None | |
| max_new_tokens: int | None = 512 |
| dist_init_addr: str | None = None | ||
| devices: list[int] = None | ||
| session_len: int | None = None | ||
| max_batch_size: int = None | ||
| cache_max_entry_count: float = 0.8 | ||
| cache_chunk_size: int = -1 | ||
| cache_block_seq_len: int = 64 | ||
| enable_prefix_caching: bool = False | ||
| quant_policy: int = 0 | ||
| rope_scaling_factor: float = 0.0 | ||
| use_logn_attn: bool = False | ||
| download_dir: Optional[str] = None | ||
| revision: Optional[str] = None | ||
| download_dir: str | None = None | ||
| revision: str | None = None | ||
| max_prefill_token_num: int = 8192 | ||
| num_tokens_per_iter: int = 0 | ||
| max_prefill_iters: int = 1 | ||
| async_: int = 1 | ||
| devices: Optional[List[int]] = None | ||
| devices: list[int] | None = None | ||
| empty_init: bool = False |
There was a problem hiding this comment.
TurbomindEngineConfig defines devices twice (once at line 315 and again at line 331). In a dataclass, the later field overwrites the earlier one, which is error-prone and can lead to confusing defaults/type hints. Remove the duplicate and keep a single devices definition with the intended type/default.
| if tail_len <= 0: | ||
| return None | ||
| if self.stop_tail is None: | ||
| return torch.zeros(batch_size, tail_len, dtype=torch.long, device=device) |
There was a problem hiding this comment.
When stop_tail is None, _get_prev_tail initializes the history tail with zeros. This can cause false positive multi-token stop matches involving token id 0 before any real tokens exist (especially in the first few decode steps). Use a sentinel that cannot match real tokens (e.g., -1) for the initial tail, consistent with later padding logic that uses value=-1.
| return torch.zeros(batch_size, tail_len, dtype=torch.long, device=device) | |
| # Initialize with sentinel -1 so no real token id (>=0) can match, | |
| # consistent with later padding logic that uses value=-1. | |
| return torch.full((batch_size, tail_len), -1, dtype=torch.long, device=device) |
| @staticmethod | ||
| def _normalize_stop_token_ids(ids: list[int] | list[list[int]] | None) -> list[list[int]]: | ||
| """Normalize stop_token_ids to list[list[int]].""" | ||
| if ids is None: | ||
| return [] | ||
| out: list[list[int]] = [] | ||
| for item in ids: | ||
| if isinstance(item, int): | ||
| out.append([item]) | ||
| else: | ||
| out.append(list(item)) | ||
| return out | ||
|
|
||
| def convert_stop_bad_words_to_ids(self, tokenizer: Tokenizer): | ||
| """Convert stop_words/bad_sords to ids and append the ids to | ||
| """Convert stop_words/bad_words to ids and append the ids to | ||
| stop_token_ids/bad_token_ids.""" | ||
|
|
||
| def special_word_token_ids(words): | ||
| if words is not None: | ||
| assert isinstance(words, List) and \ | ||
| all(isinstance(elem, str) for elem in words), \ | ||
| f'stop_words must be a list of str but got {type(words)}' | ||
| indexes = [] | ||
| for word in words: | ||
| indexes += tokenizer.indexes_containing_token(word) | ||
| return indexes | ||
| return None | ||
|
|
||
| stop_token_ids = special_word_token_ids(self.stop_words) or [] | ||
| bad_token_ids = special_word_token_ids(self.bad_words) or [] | ||
| stop_token_ids.extend(self.stop_token_ids or []) | ||
| bad_token_ids.extend(self.bad_token_ids or []) | ||
| self.stop_token_ids = list(set(stop_token_ids)) or None | ||
| self.bad_token_ids = list(set(bad_token_ids)) or None | ||
| def words_to_token_seqs(words: list[str]) -> list[list[int]]: | ||
| assert isinstance(words, list) and \ | ||
| all(isinstance(elem, str) for elem in words), \ | ||
| f'stop_words must be a list of str but got {type(words)}' | ||
| seqs: list[list[int]] = [] | ||
| for word in words: | ||
| single_matches = tokenizer.indexes_containing_token(word) | ||
| if single_matches: | ||
| for idx in single_matches: | ||
| seqs.append([idx]) | ||
| else: | ||
| encoded = tokenizer.encode(word, add_bos=False) | ||
| if encoded: | ||
| seqs.append(encoded) | ||
| return seqs | ||
|
|
||
| stop_seqs = words_to_token_seqs(self.stop_words) if self.stop_words else [] | ||
| bad_seqs = words_to_token_seqs(self.bad_words) if self.bad_words else [] | ||
|
|
||
| stop_seqs.extend(self._normalize_stop_token_ids(self.stop_token_ids)) | ||
| bad_seqs.extend([[i] for i in (self.bad_token_ids or [])]) | ||
|
|
||
| # deduplicate stop_token_ids and bad_token_ids | ||
| seen = set() | ||
| deduped: list[list[int]] = [] | ||
| for seq in stop_seqs: | ||
| key = tuple(seq) | ||
| if key not in seen: | ||
| seen.add(key) | ||
| deduped.append(seq) | ||
| self.stop_token_ids = deduped or None | ||
|
|
There was a problem hiding this comment.
This PR changes GenerationConfig.stop_token_ids to be normalized/stored as list[list[int]] (to support multi-token stop sequences). There are existing unit tests (e.g. tests/test_lmdeploy/test_messages.py::test_engine_generation_config) that assert stop_token_ids is List[int], so the current test suite will fail and there are no new tests covering multi-token stop sequences. Please update the existing tests and add coverage for: (1) mixed stop_token_ids input ([int] and [[...]]), (2) multi-token stop_words encoding path, and (3) streaming stop behavior (holdback) if applicable.
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
Please describe the motivation of this PR and the goal you want to achieve through this PR.
Modification
Please briefly describe what modification is made in this PR.
BC-breaking (Optional)
Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.
Checklist