Skip to content

Support multi stop words#4454

Open
lvhan028 wants to merge 7 commits intoInternLM:mainfrom
lvhan028:support-multi-stop-words
Open

Support multi stop words#4454
lvhan028 wants to merge 7 commits intoInternLM:mainfrom
lvhan028:support-multi-stop-words

Conversation

@lvhan028
Copy link
Collaborator

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

Copilot AI review requested due to automatic review settings March 24, 2026 06:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to extend stop-word handling to support multiple stop words and multi-token stop sequences across LMDeploy’s configs and runtime (TurboMind, PyTorch, and streaming AsyncEngine).

Changes:

  • Normalize/convert stop words into token-id sequences (list[list[int]]) and update tokenization helpers accordingly.
  • Update TurboMind and PyTorch pipelines to carry multi-stop metadata (packed stop/bad words for TM; stop_word_lens for PyTorch).
  • Add streaming-time multi-stop detection/holdback logic in AsyncEngine.generate().

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
lmdeploy/utils.py Change _stop_words to return token-id sequences instead of a packed numpy structure.
lmdeploy/turbomind/turbomind.py Pack stop/bad word sequences into TurboMind’s (tokens, offsets) format; adjust eos/stop/bad wiring.
lmdeploy/tokenizer.py Make indexes_containing_token() return empty for multi-token encodings so callers can fall back to encode().
lmdeploy/serve/core/async_engine.py Implement streaming multi-stop matching with tail holdback; refactor stop-id handling.
lmdeploy/pytorch/strategies/dllm/sampling.py Propagate renamed sampling input attribute from stop_mask to stop_word_lens.
lmdeploy/pytorch/strategies/ar/sampling.py Build batched stop-word tensors for multi-token sequences plus per-sequence lengths.
lmdeploy/pytorch/strategies/ar/model_agent.py Add vectorized multi-token stop detection spanning decode steps via a maintained tail.
lmdeploy/pytorch/messages.py Update PyTorch SamplingParam.stop_words to be sequence-based and adjust ignore_eos behavior.
lmdeploy/pytorch/engine/model_agent/agent.py Pass output_token_ids into stopping criteria and thread through stop_word_lens.
lmdeploy/pytorch/engine/logits_process.py Update SamplingInputs to use stop_word_lens and only mask single-token stops during ignore-eos.
lmdeploy/messages.py Change GenerationConfig.stop_token_ids to support nested sequences; add normalization and updated conversion logic.
Comments suppressed due to low confidence (1)

lmdeploy/serve/core/async_engine.py:383

  • generate() now calls _determine_gen_config(session, input_ids, ...) before input_ids is populated from messages via prompt_processor.get_prompt_input(). When messages is provided (the common path), input_ids is still None here and _determine_gen_config will raise when it does len(input_ids) and other processing. Move this call back to after input_ids is available, or change _determine_gen_config to accept None and defer max_new_tokens computation until after tokenization.
        gen_config = self._determine_gen_config(session, input_ids, gen_config=gen_config)
        if messages:
            prompt = messages
            self.request_logger.log_prompt(session, prompt=prompt)
            prompt_input = await self.prompt_processor.get_prompt_input(prompt=prompt,
                                                                        do_preprocess=do_preprocess,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


# This assumes the engine will stop when stop token is hit
if output_len and outputs.token_ids[-1] in stop_ids:
# print(f'outputs.token_ids: {outputs.token_ids}')
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the commented-out debug statement (# print(f'outputs.token_ids: ...')). Leaving debug prints in production code adds noise and can confuse future readers when troubleshooting stop-word logic.

Suggested change
# print(f'outputs.token_ids: {outputs.token_ids}')

Copilot uses AI. Check for mistakes.
Comment on lines 697 to 707
# stopping criteria
# Use output_token_ids (all tokens accepted this step) so that multi-token
# stop sequences whose last token is not the final spec-decoded token are
# detected correctly. For non-spec AR, output_token_ids == next_token_ids.
stopped, stop_pos, stopping_criteria = stopping_criteria.step(
next_token_ids,
output_token_ids,
sampling_inputs.stop_words,
inputs=inputs,
extra_inputs=extra_inputs,
stop_word_lens=sampling_inputs.stop_word_lens,
)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stopping_criteria.step(...) is now called with the new keyword argument stop_word_lens=..., but most StoppingCriteria implementations (e.g. ARSpecStoppingCriteria.step, DLLMStoppingCriteria.step, and the abstract base signature) do not accept this parameter. This will raise TypeError: got an unexpected keyword argument 'stop_word_lens' when using non-AR strategies. Either update the base StoppingCriteria.step contract (and all implementations) to accept stop_word_lens (or **kwargs), or only pass stop_word_lens when the concrete stopping criteria supports it.

Copilot uses AI. Check for mistakes.
Comment on lines +198 to 206
def _stop_words(stop_words: list[int | str], tokenizer: object) -> list[list[int]] | None:
"""Convert chat-template stop words to List[List[int]].

Each element is a token ID sequence representing one stop word. Single-token matches from vocab scan produce
length-1 lists. Multi-token words that require encoding produce longer lists.
"""
if stop_words is None:
return None
assert isinstance(stop_words, list) and \
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _stop_words parameter type is declared as list[int | str], but the function explicitly supports None (if stop_words is None: return None) and is called with self.chat_template.stop_words which can be None. Please update the signature to reflect this (e.g., stop_words: list[int | str] | None) so callers and type-checkers match runtime behavior.

Copilot uses AI. Check for mistakes.
if key not in seen:
seen.add(key)
deduped.append(seq)
self.stop_token_ids = deduped or None
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenerationConfig.convert_stop_bad_words_to_ids() now mutates self.stop_token_ids to a nested list[list[int]]. Since GenerationConfig is a public config object (and previously exposed stop_token_ids as List[int]), this is a backward-incompatible behavioral change for downstream code that inspects or serializes stop_token_ids. Consider either (a) keeping stop_token_ids externally as list[int] when only single-token stops are present, (b) introducing a new field for multi-token stop sequences, or (c) clearly documenting this change and providing a helper accessor that returns the legacy flattened single-token view.

Suggested change
self.stop_token_ids = deduped or None
# Preserve backward-compatible shape: use flat list[int] when only
# single-token stop sequences are present; otherwise keep nested list[list[int]].
if deduped and all(len(seq) == 1 for seq in deduped):
self.stop_token_ids = [seq[0] for seq in deduped] or None
else:
self.stop_token_ids = deduped or None

Copilot uses AI. Check for mistakes.
Comment on lines +100 to 103
# During min_new_tokens period suppress single-token stops as bad words.
bw = bw + [s[0] for s in sw if len(s) == 1]
bad_words[idx] = bw
stop_words[idx] = sw
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the min_new_tokens period, the code only suppresses single-token stop sequences by adding them to bad_words. Multi-token stop sequences are not suppressed, so the model can still generate them and StoppingCriteria.step() can stop early, violating the min_new_tokens guarantee. Consider disabling stop-word checks entirely until seq.num_new_tokens >= min_new_tokens (or implementing a multi-token-aware suppression strategy) so min_new_tokens is honored for all stop sequences.

Suggested change
# During min_new_tokens period suppress single-token stops as bad words.
bw = bw + [s[0] for s in sw if len(s) == 1]
bad_words[idx] = bw
stop_words[idx] = sw
# During min_new_tokens period suppress single-token stops as bad words
# and disable stop-word based early stopping entirely.
bw = bw + [s[0] for s in sw if len(s) == 1]
effective_sw = []
else:
effective_sw = sw
bad_words[idx] = bw
stop_words[idx] = effective_sw

Copilot uses AI. Check for mistakes.

n: int = 1
max_new_tokens: int = 512
max_new_tokens: int = None
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenerationConfig.max_new_tokens now defaults to None, but other code paths construct GenerationConfig() and immediately treat max_new_tokens as an int. For example, lmdeploy/pytorch/engine/engine_instance.py creates gen_config = gen_config or GenerationConfig() and SamplingParam.from_gen_config() then does numeric comparisons (max_new_tokens < 0, etc.), which will raise a TypeError when max_new_tokens is None. Either keep the default as an int (e.g., 512) or update all consumers (not just AsyncEngine._determine_gen_config) to handle None safely.

Suggested change
max_new_tokens: int = None
max_new_tokens: int | None = 512

Copilot uses AI. Check for mistakes.
Comment on lines +314 to 332
dist_init_addr: str | None = None
devices: list[int] = None
session_len: int | None = None
max_batch_size: int = None
cache_max_entry_count: float = 0.8
cache_chunk_size: int = -1
cache_block_seq_len: int = 64
enable_prefix_caching: bool = False
quant_policy: int = 0
rope_scaling_factor: float = 0.0
use_logn_attn: bool = False
download_dir: Optional[str] = None
revision: Optional[str] = None
download_dir: str | None = None
revision: str | None = None
max_prefill_token_num: int = 8192
num_tokens_per_iter: int = 0
max_prefill_iters: int = 1
async_: int = 1
devices: Optional[List[int]] = None
devices: list[int] | None = None
empty_init: bool = False
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TurbomindEngineConfig defines devices twice (once at line 315 and again at line 331). In a dataclass, the later field overwrites the earlier one, which is error-prone and can lead to confusing defaults/type hints. Remove the duplicate and keep a single devices definition with the intended type/default.

Copilot uses AI. Check for mistakes.
if tail_len <= 0:
return None
if self.stop_tail is None:
return torch.zeros(batch_size, tail_len, dtype=torch.long, device=device)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When stop_tail is None, _get_prev_tail initializes the history tail with zeros. This can cause false positive multi-token stop matches involving token id 0 before any real tokens exist (especially in the first few decode steps). Use a sentinel that cannot match real tokens (e.g., -1) for the initial tail, consistent with later padding logic that uses value=-1.

Suggested change
return torch.zeros(batch_size, tail_len, dtype=torch.long, device=device)
# Initialize with sentinel -1 so no real token id (>=0) can match,
# consistent with later padding logic that uses value=-1.
return torch.full((batch_size, tail_len), -1, dtype=torch.long, device=device)

Copilot uses AI. Check for mistakes.
Comment on lines +138 to +186
@staticmethod
def _normalize_stop_token_ids(ids: list[int] | list[list[int]] | None) -> list[list[int]]:
"""Normalize stop_token_ids to list[list[int]]."""
if ids is None:
return []
out: list[list[int]] = []
for item in ids:
if isinstance(item, int):
out.append([item])
else:
out.append(list(item))
return out

def convert_stop_bad_words_to_ids(self, tokenizer: Tokenizer):
"""Convert stop_words/bad_sords to ids and append the ids to
"""Convert stop_words/bad_words to ids and append the ids to
stop_token_ids/bad_token_ids."""

def special_word_token_ids(words):
if words is not None:
assert isinstance(words, List) and \
all(isinstance(elem, str) for elem in words), \
f'stop_words must be a list of str but got {type(words)}'
indexes = []
for word in words:
indexes += tokenizer.indexes_containing_token(word)
return indexes
return None

stop_token_ids = special_word_token_ids(self.stop_words) or []
bad_token_ids = special_word_token_ids(self.bad_words) or []
stop_token_ids.extend(self.stop_token_ids or [])
bad_token_ids.extend(self.bad_token_ids or [])
self.stop_token_ids = list(set(stop_token_ids)) or None
self.bad_token_ids = list(set(bad_token_ids)) or None
def words_to_token_seqs(words: list[str]) -> list[list[int]]:
assert isinstance(words, list) and \
all(isinstance(elem, str) for elem in words), \
f'stop_words must be a list of str but got {type(words)}'
seqs: list[list[int]] = []
for word in words:
single_matches = tokenizer.indexes_containing_token(word)
if single_matches:
for idx in single_matches:
seqs.append([idx])
else:
encoded = tokenizer.encode(word, add_bos=False)
if encoded:
seqs.append(encoded)
return seqs

stop_seqs = words_to_token_seqs(self.stop_words) if self.stop_words else []
bad_seqs = words_to_token_seqs(self.bad_words) if self.bad_words else []

stop_seqs.extend(self._normalize_stop_token_ids(self.stop_token_ids))
bad_seqs.extend([[i] for i in (self.bad_token_ids or [])])

# deduplicate stop_token_ids and bad_token_ids
seen = set()
deduped: list[list[int]] = []
for seq in stop_seqs:
key = tuple(seq)
if key not in seen:
seen.add(key)
deduped.append(seq)
self.stop_token_ids = deduped or None

Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR changes GenerationConfig.stop_token_ids to be normalized/stored as list[list[int]] (to support multi-token stop sequences). There are existing unit tests (e.g. tests/test_lmdeploy/test_messages.py::test_engine_generation_config) that assert stop_token_ids is List[int], so the current test suite will fail and there are no new tests covering multi-token stop sequences. Please update the existing tests and add coverage for: (1) mixed stop_token_ids input ([int] and [[...]]), (2) multi-token stop_words encoding path, and (3) streaming stop behavior (holdback) if applicable.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants