Support multi stop words by lvhan028 · Pull Request #4454 · InternLM/lmdeploy

lvhan028 · 2026-03-24T06:37:52Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

…streaming

Copilot

Pull request overview

This PR aims to extend stop-word handling to support multiple stop words and multi-token stop sequences across LMDeploy’s configs and runtime (TurboMind, PyTorch, and streaming AsyncEngine).

Changes:

Normalize/convert stop words into token-id sequences (list[list[int]]) and update tokenization helpers accordingly.
Update TurboMind and PyTorch pipelines to carry multi-stop metadata (packed stop/bad words for TM; stop_word_lens for PyTorch).
Add streaming-time multi-stop detection/holdback logic in AsyncEngine.generate().

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
lmdeploy/utils.py	Change `_stop_words` to return token-id sequences instead of a packed numpy structure.
lmdeploy/turbomind/turbomind.py	Pack stop/bad word sequences into TurboMind’s `(tokens, offsets)` format; adjust eos/stop/bad wiring.
lmdeploy/tokenizer.py	Make `indexes_containing_token()` return empty for multi-token encodings so callers can fall back to `encode()`.
lmdeploy/serve/core/async_engine.py	Implement streaming multi-stop matching with tail holdback; refactor stop-id handling.
lmdeploy/pytorch/strategies/dllm/sampling.py	Propagate renamed sampling input attribute from `stop_mask` to `stop_word_lens`.
lmdeploy/pytorch/strategies/ar/sampling.py	Build batched stop-word tensors for multi-token sequences plus per-sequence lengths.
lmdeploy/pytorch/strategies/ar/model_agent.py	Add vectorized multi-token stop detection spanning decode steps via a maintained tail.
lmdeploy/pytorch/messages.py	Update PyTorch `SamplingParam.stop_words` to be sequence-based and adjust ignore_eos behavior.
lmdeploy/pytorch/engine/model_agent/agent.py	Pass `output_token_ids` into stopping criteria and thread through `stop_word_lens`.
lmdeploy/pytorch/engine/logits_process.py	Update `SamplingInputs` to use `stop_word_lens` and only mask single-token stops during ignore-eos.
lmdeploy/messages.py	Change `GenerationConfig.stop_token_ids` to support nested sequences; add normalization and updated conversion logic.

Comments suppressed due to low confidence (1)

lmdeploy/serve/core/async_engine.py:383

generate() now calls _determine_gen_config(session, input_ids, ...) before input_ids is populated from messages via prompt_processor.get_prompt_input(). When messages is provided (the common path), input_ids is still None here and _determine_gen_config will raise when it does len(input_ids) and other processing. Move this call back to after input_ids is available, or change _determine_gen_config to accept None and defer max_new_tokens computation until after tokenization.

        gen_config = self._determine_gen_config(session, input_ids, gen_config=gen_config)
        if messages:
            prompt = messages
            self.request_logger.log_prompt(session, prompt=prompt)
            prompt_input = await self.prompt_processor.get_prompt_input(prompt=prompt,
                                                                        do_preprocess=do_preprocess,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-24T06:44:18Z

lmdeploy/serve/core/async_engine.py

-
-                    # This assumes the engine will stop when stop token is hit
-                    if output_len and outputs.token_ids[-1] in stop_ids:
+                    # print(f'outputs.token_ids: {outputs.token_ids}')


Please remove the commented-out debug statement (# print(f'outputs.token_ids: ...')). Leaving debug prints in production code adds noise and can confuse future readers when troubleshooting stop-word logic.

Suggested change

# print(f'outputs.token_ids: {outputs.token_ids}')

Copilot · 2026-03-24T06:44:19Z

lmdeploy/pytorch/engine/model_agent/agent.py

            # stopping criteria
+            # Use output_token_ids (all tokens accepted this step) so that multi-token
+            # stop sequences whose last token is not the final spec-decoded token are
+            # detected correctly. For non-spec AR, output_token_ids == next_token_ids.
            stopped, stop_pos, stopping_criteria = stopping_criteria.step(
-                next_token_ids,
+                output_token_ids,
                sampling_inputs.stop_words,
                inputs=inputs,
                extra_inputs=extra_inputs,
+                stop_word_lens=sampling_inputs.stop_word_lens,
            )


stopping_criteria.step(...) is now called with the new keyword argument stop_word_lens=..., but most StoppingCriteria implementations (e.g. ARSpecStoppingCriteria.step, DLLMStoppingCriteria.step, and the abstract base signature) do not accept this parameter. This will raise TypeError: got an unexpected keyword argument 'stop_word_lens' when using non-AR strategies. Either update the base StoppingCriteria.step contract (and all implementations) to accept stop_word_lens (or **kwargs), or only pass stop_word_lens when the concrete stopping criteria supports it.

Copilot · 2026-03-24T06:44:19Z

lmdeploy/utils.py

+def _stop_words(stop_words: list[int | str], tokenizer: object) -> list[list[int]] | None:
+    """Convert chat-template stop words to List[List[int]].
+
+    Each element is a token ID sequence representing one stop word. Single-token matches from vocab scan produce
+    length-1 lists. Multi-token words that require encoding produce longer lists.
+    """
    if stop_words is None:
        return None
    assert isinstance(stop_words, list) and \


The _stop_words parameter type is declared as list[int | str], but the function explicitly supports None (if stop_words is None: return None) and is called with self.chat_template.stop_words which can be None. Please update the signature to reflect this (e.g., stop_words: list[int | str] | None) so callers and type-checkers match runtime behavior.

Copilot · 2026-03-24T06:44:19Z

lmdeploy/messages.py

+            if key not in seen:
+                seen.add(key)
+                deduped.append(seq)
+        self.stop_token_ids = deduped or None


GenerationConfig.convert_stop_bad_words_to_ids() now mutates self.stop_token_ids to a nested list[list[int]]. Since GenerationConfig is a public config object (and previously exposed stop_token_ids as List[int]), this is a backward-incompatible behavioral change for downstream code that inspects or serializes stop_token_ids. Consider either (a) keeping stop_token_ids externally as list[int] when only single-token stops are present, (b) introducing a new field for multi-token stop sequences, or (c) clearly documenting this change and providing a helper accessor that returns the legacy flattened single-token view.

Suggested change

self.stop_token_ids = deduped or None

# Preserve backward-compatible shape: use flat list[int] when only

# single-token stop sequences are present; otherwise keep nested list[list[int]].

if deduped and all(len(seq) == 1 for seq in deduped):

self.stop_token_ids = [seq[0] for seq in deduped] or None

else:

self.stop_token_ids = deduped or None

Copilot · 2026-03-24T06:44:19Z

lmdeploy/pytorch/strategies/ar/sampling.py

+                    # During min_new_tokens period suppress single-token stops as bad words.
+                    bw = bw + [s[0] for s in sw if len(s) == 1]
                bad_words[idx] = bw
                stop_words[idx] = sw


During the min_new_tokens period, the code only suppresses single-token stop sequences by adding them to bad_words. Multi-token stop sequences are not suppressed, so the model can still generate them and StoppingCriteria.step() can stop early, violating the min_new_tokens guarantee. Consider disabling stop-word checks entirely until seq.num_new_tokens >= min_new_tokens (or implementing a multi-token-aware suppression strategy) so min_new_tokens is honored for all stop sequences.

Suggested change

# During min_new_tokens period suppress single-token stops as bad words.

bw = bw + [s[0] for s in sw if len(s) == 1]

bad_words[idx] = bw

stop_words[idx] = sw

# During min_new_tokens period suppress single-token stops as bad words

# and disable stop-word based early stopping entirely.

bw = bw + [s[0] for s in sw if len(s) == 1]

effective_sw = []

else:

effective_sw = sw

bad_words[idx] = bw

stop_words[idx] = effective_sw

Copilot · 2026-03-24T06:44:20Z

lmdeploy/messages.py


    n: int = 1
-    max_new_tokens: int = 512
+    max_new_tokens: int = None


GenerationConfig.max_new_tokens now defaults to None, but other code paths construct GenerationConfig() and immediately treat max_new_tokens as an int. For example, lmdeploy/pytorch/engine/engine_instance.py creates gen_config = gen_config or GenerationConfig() and SamplingParam.from_gen_config() then does numeric comparisons (max_new_tokens < 0, etc.), which will raise a TypeError when max_new_tokens is None. Either keep the default as an int (e.g., 512) or update all consumers (not just AsyncEngine._determine_gen_config) to handle None safely.

Suggested change

max_new_tokens: int = None

max_new_tokens: int | None = 512

Copilot · 2026-03-24T06:44:20Z

lmdeploy/messages.py

+    dist_init_addr: str | None = None
+    devices: list[int] = None
+    session_len: int | None = None
    max_batch_size: int = None
    cache_max_entry_count: float = 0.8
    cache_chunk_size: int = -1
    cache_block_seq_len: int = 64
    enable_prefix_caching: bool = False
    quant_policy: int = 0
    rope_scaling_factor: float = 0.0
    use_logn_attn: bool = False
-    download_dir: Optional[str] = None
-    revision: Optional[str] = None
+    download_dir: str | None = None
+    revision: str | None = None
    max_prefill_token_num: int = 8192
    num_tokens_per_iter: int = 0
    max_prefill_iters: int = 1
    async_: int = 1
-    devices: Optional[List[int]] = None
+    devices: list[int] | None = None
    empty_init: bool = False


TurbomindEngineConfig defines devices twice (once at line 315 and again at line 331). In a dataclass, the later field overwrites the earlier one, which is error-prone and can lead to confusing defaults/type hints. Remove the duplicate and keep a single devices definition with the intended type/default.

Copilot · 2026-03-24T06:44:20Z

lmdeploy/pytorch/strategies/ar/model_agent.py

+        if tail_len <= 0:
+            return None
+        if self.stop_tail is None:
+            return torch.zeros(batch_size, tail_len, dtype=torch.long, device=device)


When stop_tail is None, _get_prev_tail initializes the history tail with zeros. This can cause false positive multi-token stop matches involving token id 0 before any real tokens exist (especially in the first few decode steps). Use a sentinel that cannot match real tokens (e.g., -1) for the initial tail, consistent with later padding logic that uses value=-1.

Suggested change

return torch.zeros(batch_size, tail_len, dtype=torch.long, device=device)

# Initialize with sentinel -1 so no real token id (>=0) can match,

# consistent with later padding logic that uses value=-1.

return torch.full((batch_size, tail_len), -1, dtype=torch.long, device=device)

Copilot · 2026-03-24T06:44:20Z

lmdeploy/messages.py

+    @staticmethod
+    def _normalize_stop_token_ids(ids: list[int] | list[list[int]] | None) -> list[list[int]]:
+        """Normalize stop_token_ids to list[list[int]]."""
+        if ids is None:
+            return []
+        out: list[list[int]] = []
+        for item in ids:
+            if isinstance(item, int):
+                out.append([item])
+            else:
+                out.append(list(item))
+        return out
+
    def convert_stop_bad_words_to_ids(self, tokenizer: Tokenizer):
-        """Convert stop_words/bad_sords to ids and append the ids to
+        """Convert stop_words/bad_words to ids and append the ids to
        stop_token_ids/bad_token_ids."""

-        def special_word_token_ids(words):
-            if words is not None:
-                assert isinstance(words, List) and \
-                    all(isinstance(elem, str) for elem in words), \
-                    f'stop_words must be a list of str but got {type(words)}'
-                indexes = []
-                for word in words:
-                    indexes += tokenizer.indexes_containing_token(word)
-                return indexes
-            return None
-
-        stop_token_ids = special_word_token_ids(self.stop_words) or []
-        bad_token_ids = special_word_token_ids(self.bad_words) or []
-        stop_token_ids.extend(self.stop_token_ids or [])
-        bad_token_ids.extend(self.bad_token_ids or [])
-        self.stop_token_ids = list(set(stop_token_ids)) or None
-        self.bad_token_ids = list(set(bad_token_ids)) or None
+        def words_to_token_seqs(words: list[str]) -> list[list[int]]:
+            assert isinstance(words, list) and \
+                all(isinstance(elem, str) for elem in words), \
+                f'stop_words must be a list of str but got {type(words)}'
+            seqs: list[list[int]] = []
+            for word in words:
+                single_matches = tokenizer.indexes_containing_token(word)
+                if single_matches:
+                    for idx in single_matches:
+                        seqs.append([idx])
+                else:
+                    encoded = tokenizer.encode(word, add_bos=False)
+                    if encoded:
+                        seqs.append(encoded)
+            return seqs
+
+        stop_seqs = words_to_token_seqs(self.stop_words) if self.stop_words else []
+        bad_seqs = words_to_token_seqs(self.bad_words) if self.bad_words else []
+
+        stop_seqs.extend(self._normalize_stop_token_ids(self.stop_token_ids))
+        bad_seqs.extend([[i] for i in (self.bad_token_ids or [])])
+
+        # deduplicate stop_token_ids and bad_token_ids
+        seen = set()
+        deduped: list[list[int]] = []
+        for seq in stop_seqs:
+            key = tuple(seq)
+            if key not in seen:
+                seen.add(key)
+                deduped.append(seq)
+        self.stop_token_ids = deduped or None
+


This PR changes GenerationConfig.stop_token_ids to be normalized/stored as list[list[int]] (to support multi-token stop sequences). There are existing unit tests (e.g. tests/test_lmdeploy/test_messages.py::test_engine_generation_config) that assert stop_token_ids is List[int], so the current test suite will fail and there are no new tests covering multi-token stop sequences. Please update the existing tests and add coverage for: (1) mixed stop_token_ids input ([int] and [[...]]), (2) multi-token stop_words encoding path, and (3) streaming stop behavior (holdback) if applicable.

lvhan028 added 6 commits March 23, 2026 07:56

for turbomind

f959a2f

buffer multi-token stop phrases to prevent partial prefix leakage in …

9e58185

…streaming

for pytorch

51dd33c

vectorize step function

8874b4f

optimize

9985667

update _check_single_stop_words

27bdb55

Copilot AI review requested due to automatic review settings March 24, 2026 06:37

Copilot started reviewing on behalf of lvhan028 March 24, 2026 06:38 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

merge latest main

b7b086b

lvhan028 added the improvement label Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multi stop words#4454

Support multi stop words#4454
lvhan028 wants to merge 7 commits intoInternLM:mainfrom
lvhan028:support-multi-stop-words

lvhan028 commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        self.stop_token_ids = deduped or None
+        # Preserve backward-compatible shape: use flat list[int] when only
+        # single-token stop sequences are present; otherwise keep nested list[list[int]].
+        if deduped and all(len(seq) == 1 for seq in deduped):
+            self.stop_token_ids = [seq[0] for seq in deduped] or None
+        else:
+            self.stop_token_ids = deduped or None

-            return torch.zeros(batch_size, tail_len, dtype=torch.long, device=device)
+            # Initialize with sentinel -1 so no real token id (>=0) can match,
+            # consistent with later padding logic that uses value=-1.
+            return torch.full((batch_size, tail_len), -1, dtype=torch.long, device=device)

Conversation

lvhan028 commented Mar 24, 2026

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants