[Speculative Decoding] Refine ngram kernel signature and adapt ngram proposer by NKNaN · Pull Request #7774 · PaddlePaddle/FastDeploy

NKNaN · 2026-05-11T08:10:53Z

Motivation

投机解码 ngram 方法端到端结果验证

Modifications

测试脚本（AI studio A800单卡环境能够跑通）：

# test.py
from fastdeploy import LLM, SamplingParams

# 场景1：代码生成——变量名、关键字、结构大量重复，ngram 命中率高
msg1 = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": (
        "用 Python 写一个 Student 类，包含以下方法：\n"
        "1. __init__(self, name, age, score)\n"
        "2. get_name(self) 返回 self.name\n"
        "3. get_age(self) 返回 self.age\n"
        "4. get_score(self) 返回 self.score\n"
        "5. set_name(self, name) 设置 self.name\n"
        "6. set_age(self, age) 设置 self.age\n"
        "7. set_score(self, score) 设置 self.score\n"
        "8. __repr__(self) 返回 f'Student(name={self.name}, age={self.age}, score={self.score})'\n"
        "请完整实现所有方法。"
    )},
]

# 场景2：结构化列表——每条格式相同，生成时前缀 n-gram 高度重复
msg2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": (
        "请列出20个中国城市，每条格式为：\n"
        "城市名：xxx，省份：xxx，人口：约xxx万，著名景点：xxx\n"
        "请严格按照这个格式输出全部20条，不要省略。"
    )},
]

messages = [msg1, msg2]

# 采样参数
sampling_params = SamplingParams(top_p=0.95, max_tokens=6400)

# 加载模型
llm = LLM(
    model="baidu/ERNIE-4.5-0.3B-Paddle",
    tensor_parallel_size=1,
    max_model_len=8192,
    speculative_config={
        "method": "ngram",
        "num_speculative_tokens": 5,   # 每轮最多投机 5 个 draft token，范围 [1, 5]
        "max_ngram_size": 5,           # 最大 n-gram 窗口，默认 5
    },
   # enable_overlap_schedule=True,
)

outputs = llm.chat(messages, sampling_params)

# 输出结果
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs.text
    print(prompt)
    print(generated_text)

修改ngram kernel接口：
由于 input_ids 和 pre_ids 目前全部并入 token_ids_all，将原本接口中的 input_ids 删除，prompt tokens 和 predict tokens 完全由 token_ids_all 负责记录。

确认修改后的 ngram match kernel 端到端执行正确：

token_ids_all 与 input_ids_cpu 的初始化:

# fastdeploy\worker\input_batch.py: 114-115
self.token_ids_all = paddle.full(
    [max_num_seqs, self.model_config.max_model_len], ...
)

# fastdeploy\worker\input_batch.py: 280-281
self.input_ids_cpu = paddle.full(
    shape=[max_num_seqs, self.model_config.max_model_len], ...
)

验证 token_ids_all prompt 部分的写入（gpu_model_runner中）和读取（NgramProposer._run_impl中）内容一致（通过打印 log 查看）：

# fastdeploy\worker\gpu_model_runner.py: 916-919
# prompt_tokens
async_set_value(self.share_inputs["token_ids_all"][idx : idx + 1, :prompt_len], prompt_token_ids)
# generated_token_ids fill -1
self.share_inputs["token_ids_all"][idx : idx + 1, prompt_len:] = -1

## 在此处打印 token_ids_all[i, 0:20] 和 token_ids_all[i, prompt_len-3:prompt_len+3] 到日志
logger.info(f"[NGRAM][VERIFY-WRITE] idx={idx} prompt_len={prompt_len} "
        f"token_ids_all[0:20]={self.share_inputs['token_ids_all'][idx, :20].tolist()} "
        f"token_ids_all[pl-3:pl+3]={self.share_inputs['token_ids_all'][idx, prompt_len-3:prompt_len+3].tolist()}")

# 在 ngram.py _run_impl 开头添加
def _run_impl(self, share_inputs):
    """
    run
    """
if not hasattr(self, '_debug_call_count'):
    self._debug_call_count = 0
if self._debug_call_count < 3:
    pl = share_inputs["prompt_lens"]
    tia = share_inputs["token_ids_all"]
    si = share_inputs["step_idx"]
    for bid in range(pl.shape[0]):
        plen = int(pl[bid].item())
        if plen > 0:
            logger.info(f"[NGRAM][VERIFY-READ] call={self._debug_call_count} bid={bid} "
                        f"step_idx={int(si[bid].item())} prompt_len={plen} "
                        f"token_ids_all[0:20]={tia[bid, :20].tolist()} "
                        f"token_ids_all[pl-3:pl]={tia[bid, plen-3:plen].tolist()}"
                        f"seq_lens_dec={int(share_inputs['seq_lens_decoder'][bid].item())} ")
    self._debug_call_count += 1

ngram_match(...)

# 查看log
(base) aistudio@ssh-5453289-10284016-bf48d89cf-ph9f8:~/FastDeploy$ grep '\[NGRAM\]' log/paddle/workerlog.0
INFO     2026-05-10 13:31:12,504 684126 gpu_model_runner.py[line:920] [NGRAM][VERIFY-WRITE] idx=0 prompt_len=168 token_ids_all[0:20]=[100273, 2520, 524, 274, 20472, 17461, 27963, 93937, 23, 2969, 93963, 16816, 12199, 93919, 94667, 748, 36619, 69716, 93956, 10553] token_ids_all[pl-3:pl+3]=[92267, 93963, 93919, -1, -1, -1]
INFO     2026-05-10 13:31:12,514 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=0 step_idx=1 prompt_len=168 token_ids_all[0:20]=[100273, 2520, 524, 274, 20472, 17461, 27963, 93937, 23, 2969, 93963, 16816, 12199, 93919, 94667, 748, 36619, 69716, 93956, 10553] token_ids_all[pl-3:pl]=[92267, 93963, 93919]seq_lens_dec=168 
INFO     2026-05-10 13:31:12,516 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=1 step_idx=13 prompt_len=4096 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
INFO     2026-05-10 13:31:12,516 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=2 step_idx=13 prompt_len=2048 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
INFO     2026-05-10 13:31:12,517 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=3 step_idx=13 prompt_len=2048 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
INFO     2026-05-10 13:31:12,517 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=4 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
INFO     2026-05-10 13:31:12,517 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=5 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
INFO     2026-05-10 13:31:12,518 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=6 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
INFO     2026-05-10 13:31:12,519 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=7 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
INFO     2026-05-10 13:31:12,522 684126 gpu_model_runner.py[line:920] [NGRAM][VERIFY-WRITE] idx=1 prompt_len=54 token_ids_all[0:20]=[100273, 2969, 93963, 69157, 63191, 5, 3, 94016, 1358, 3671, 93956, 94405, 94525, 14246, 94022, 94035, 23, 3671, 94312, 94035] token_ids_all[pl-3:pl+3]=[92267, 93963, 93919, -1, -1, -1]
INFO     2026-05-10 13:31:12,533 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=0 step_idx=2 prompt_len=168 token_ids_all[0:20]=[100273, 2520, 524, 274, 20472, 17461, 27963, 93937, 23, 2969, 93963, 16816, 12199, 93919, 94667, 748, 36619, 69716, 93956, 10553] token_ids_all[pl-3:pl]=[92267, 93963, 93919]seq_lens_dec=169 
INFO     2026-05-10 13:31:12,533 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=1 step_idx=1 prompt_len=54 token_ids_all[0:20]=[100273, 2969, 93963, 69157, 63191, 5, 3, 94016, 1358, 3671, 93956, 94405, 94525, 14246, 94022, 94035, 23, 3671, 94312, 94035] token_ids_all[pl-3:pl]=[92267, 93963, 93919]seq_lens_dec=54 
INFO     2026-05-10 13:31:12,535 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=2 step_idx=13 prompt_len=2048 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
INFO     2026-05-10 13:31:12,535 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=3 step_idx=13 prompt_len=2048 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
INFO     2026-05-10 13:31:12,536 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=4 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
INFO     2026-05-10 13:31:12,536 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=5 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
INFO     2026-05-10 13:31:12,536 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=6 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
INFO     2026-05-10 13:31:12,536 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=7 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
INFO     2026-05-10 13:31:12,541 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=0 step_idx=3 prompt_len=168 token_ids_all[0:20]=[100273, 2520, 524, 274, 20472, 17461, 27963, 93937, 23, 2969, 93963, 16816, 12199, 93919, 94667, 748, 36619, 69716, 93956, 10553] token_ids_all[pl-3:pl]=[92267, 93963, 93919]seq_lens_dec=170 
INFO     2026-05-10 13:31:12,542 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=1 step_idx=2 prompt_len=54 token_ids_all[0:20]=[100273, 2969, 93963, 69157, 63191, 5, 3, 94016, 1358, 3671, 93956, 94405, 94525, 14246, 94022, 94035, 23, 3671, 94312, 94035] token_ids_all[pl-3:pl]=[92267, 93963, 93919]seq_lens_dec=55 
INFO     2026-05-10 13:31:12,542 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=2 step_idx=13 prompt_len=2048 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
INFO     2026-05-10 13:31:12,542 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=3 step_idx=13 prompt_len=2048 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
INFO     2026-05-10 13:31:12,542 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=4 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
INFO     2026-05-10 13:31:12,543 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=5 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
INFO     2026-05-10 13:31:12,543 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=6 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
INFO     2026-05-10 13:31:12,543 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=7 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0

token_ids_all 为 5 时是 dummy batch，seq_lens_decoder=0。除此之外可以看到 token_ids_all prompt 部分的读和写的内容一致。

验证 ngram.py ngram_match 调用后如果匹配到了ngram, 则得到的 draft_token[i, 1:proposed_length] in token_ids_all[:prompt_len+step_idx[i]] == True

ngram_match(...)

# 在 ngram.py _run_impl 结尾添加
if not hasattr(self, '_debug_call_count'):
    self._debug_call_count = 0
if self._debug_call_count < 50:
    tia = share_inputs["token_ids_all"]
    pl  = share_inputs["prompt_lens"] 
    si  = share_inputs["step_idx"]
    dt  = share_inputs["draft_tokens"]
    slt = share_inputs["seq_lens_this_time"]
    print(f"[NGRAM-DEBUG] call={self._debug_call_count} "
        f"slt={slt.tolist()} "
        f"step_idx={si.tolist()} "
        f"prompt_lens={pl.tolist()} "
        f"draft_token_num={share_inputs['actual_draft_token_num'].tolist()} "
        f"seq_dec={share_inputs['seq_lens_decoder'].tolist()}")
    for bid in range(slt.shape[0]):
        n_proposed = int(slt[bid].item()) - 1
        if n_proposed <= 0:
            continue
        step = int(si[bid].item())
        plen = int(pl[bid].item())
        context = tia[bid, :plen + step].tolist()
        proposed = dt[bid, 1:1 + n_proposed].tolist()

        # 在 context 中查找 proposed 序列
        found = any(
            context[i:i + n_proposed] == proposed
            for i in range(len(context) - n_proposed + 1)
        )
        logger.info(f"[NGRAM][E2E] call={self._debug_call_count} bid={bid} step_idx={step}"
                    f"proposed={proposed} found_in_context={found}")
    self._debug_call_count += 1

# 查看 [NGRAM-DEBUG]
(base) aistudio@ssh-5453289-10284016-bf48d89cf-ph9f8:~/FastDeploy$ grep '\[NGRAM-DEBUG\]' log/paddle/workerlog.0
[NGRAM-DEBUG] call=0 slt=[1] step_idx=[[1], [13], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [4096], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[168, 0, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=1 slt=[1, 1] step_idx=[[2], [1], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[169, 54, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=2 slt=[1, 1] step_idx=[[3], [2], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[170, 55, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=3 slt=[6, 1] step_idx=[[4], [3], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[171, 56, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=4 slt=[1, 6] step_idx=[[5], [4], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[172, 57, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=5 slt=[6, 1] step_idx=[[6], [5], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[173, 58, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=6 slt=[6, 6] step_idx=[[7], [6], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[174, 59, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=7 slt=[6, 1] step_idx=[[8], [7], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[175, 60, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=8 slt=[6, 6] step_idx=[[10], [8], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[177, 61, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=9 slt=[1, 6] step_idx=[[11], [9], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[178, 62, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=10 slt=[1, 6] step_idx=[[12], [10], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[179, 63, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=11 slt=[6, 6] step_idx=[[13], [11], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[180, 64, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=12 slt=[6, 6] step_idx=[[15], [13], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[182, 66, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=13 slt=[1, 6] step_idx=[[16], [14], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[183, 67, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=14 slt=[1, 1] step_idx=[[17], [16], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[184, 69, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=15 slt=[6, 6] step_idx=[[18], [17], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[185, 70, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=16 slt=[1, 1] step_idx=[[19], [18], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[186, 71, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=17 slt=[6, 1] step_idx=[[20], [19], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[187, 72, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=18 slt=[6, 6] step_idx=[[21], [20], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[188, 73, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=19 slt=[6, 1] step_idx=[[22], [21], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[189, 74, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=20 slt=[1, 1] step_idx=[[23], [22], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[190, 75, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=21 slt=[1, 6] step_idx=[[24], [23], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[191, 76, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=22 slt=[6, 6] step_idx=[[25], [24], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[192, 77, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=23 slt=[6, 6] step_idx=[[31], [25], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[198, 78, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=24 slt=[1, 1] step_idx=[[35], [26], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[202, 79, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=25 slt=[6, 6] step_idx=[[36], [27], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[203, 80, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=26 slt=[1, 1] step_idx=[[37], [28], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[204, 81, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=27 slt=[1, 6] step_idx=[[38], [29], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[205, 82, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=28 slt=[6, 1] step_idx=[[39], [30], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[206, 83, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=29 slt=[4, 6] step_idx=[[40], [31], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[207, 84, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=30 slt=[1, 6] step_idx=[[41], [32], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[208, 85, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=31 slt=[1, 1] step_idx=[[42], [33], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[209, 86, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=32 slt=[1, 1] step_idx=[[43], [34], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[210, 87, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=33 slt=[6, 6] step_idx=[[44], [35], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[211, 88, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=34 slt=[6, 6] step_idx=[[45], [36], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[212, 89, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=35 slt=[6, 1] step_idx=[[46], [38], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[213, 91, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=36 slt=[1, 1] step_idx=[[47], [39], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[214, 92, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=37 slt=[6, 6] step_idx=[[48], [40], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[215, 93, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=38 slt=[6, 1] step_idx=[[49], [41], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[216, 94, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=39 slt=[1, 1] step_idx=[[50], [42], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[217, 95, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=40 slt=[1, 6] step_idx=[[51], [43], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[218, 96, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=41 slt=[6, 4] step_idx=[[52], [44], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[219, 97, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=42 slt=[6, 1] step_idx=[[53], [46], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[220, 99, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=43 slt=[6, 6] step_idx=[[54], [47], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[221, 100, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=44 slt=[6, 1] step_idx=[[56], [48], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[223, 101, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=45 slt=[6, 6] step_idx=[[57], [49], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[224, 102, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=46 slt=[6, 1] step_idx=[[58], [50], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[225, 103, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=47 slt=[1, 1] step_idx=[[59], [51], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[226, 104, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=48 slt=[6, 6] step_idx=[[60], [52], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[227, 105, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=49 slt=[6, 3] step_idx=[[61], [53], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[228, 106, 0, 0, 0, 0, 0, 0]

# 查看 [NGRAM]
(base) aistudio@ssh-5453289-10284016-bf48d89cf-ph9f8:~/FastDeploy$ grep '\[NGRAM\]' log/paddle/workerlog.0
INFO     2026-05-10 15:48:31,030 726574 ngram.py[line:84] [NGRAM][E2E] call=3 bid=0 step_idx=4proposed=[93949, 695, 7858, 804, 93937] found_in_context=True
INFO     2026-05-10 15:48:31,033 726574 ngram.py[line:84] [NGRAM][E2E] call=4 bid=1 step_idx=4proposed=[23, 3671, 94312, 94035, 14045] found_in_context=True
INFO     2026-05-10 15:48:31,037 726574 ngram.py[line:84] [NGRAM][E2E] call=5 bid=0 step_idx=6proposed=[93956, 10553, 4923, 1919, 94035] found_in_context=True
INFO     2026-05-10 15:48:31,041 726574 ngram.py[line:84] [NGRAM][E2E] call=6 bid=0 step_idx=7proposed=[4162, 1919, 93977, 23, 92267] found_in_context=True
INFO     2026-05-10 15:48:31,043 726574 ngram.py[line:84] [NGRAM][E2E] call=6 bid=1 step_idx=6proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
INFO     2026-05-10 15:48:31,046 726574 ngram.py[line:84] [NGRAM][E2E] call=7 bid=0 step_idx=8proposed=[10553, 4923, 1919, 94035, 23] found_in_context=True
INFO     2026-05-10 15:48:31,050 726574 ngram.py[line:84] [NGRAM][E2E] call=8 bid=0 step_idx=10proposed=[1919, 93977, 23, 92267, 93963] found_in_context=True
INFO     2026-05-10 15:48:31,050 726574 ngram.py[line:84] [NGRAM][E2E] call=8 bid=1 step_idx=8proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
INFO     2026-05-10 15:48:31,053 726574 ngram.py[line:84] [NGRAM][E2E] call=9 bid=1 step_idx=9proposed=[14045, 94466, 93956, 17340, 33015] found_in_context=True
INFO     2026-05-10 15:48:31,057 726574 ngram.py[line:84] [NGRAM][E2E] call=10 bid=1 step_idx=10proposed=[93937, 42854, 94035, 3991, 93956] found_in_context=True
INFO     2026-05-10 15:48:31,060 726574 ngram.py[line:84] [NGRAM][E2E] call=11 bid=0 step_idx=13proposed=[23, 4, 93937, 1377, 1472] found_in_context=True
INFO     2026-05-10 15:48:31,060 726574 ngram.py[line:84] [NGRAM][E2E] call=11 bid=1 step_idx=11proposed=[3, 94016, 1358, 3671, 93956] found_in_context=True
INFO     2026-05-10 15:48:31,064 726574 ngram.py[line:84] [NGRAM][E2E] call=12 bid=0 step_idx=15proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
INFO     2026-05-10 15:48:31,064 726574 ngram.py[line:84] [NGRAM][E2E] call=12 bid=1 step_idx=13proposed=[94016, 1358, 3671, 93956, 94405] found_in_context=True
INFO     2026-05-10 15:48:31,067 726574 ngram.py[line:84] [NGRAM][E2E] call=13 bid=1 step_idx=14proposed=[93956, 17340, 33015, 94035, 14045] found_in_context=True
INFO     2026-05-10 15:48:31,074 726574 ngram.py[line:84] [NGRAM][E2E] call=15 bid=0 step_idx=18proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
INFO     2026-05-10 15:48:31,074 726574 ngram.py[line:84] [NGRAM][E2E] call=15 bid=1 step_idx=17proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
INFO     2026-05-10 15:48:31,081 726574 ngram.py[line:84] [NGRAM][E2E] call=17 bid=0 step_idx=20proposed=[69716, 93956, 10553, 4923, 1919] found_in_context=True
INFO     2026-05-10 15:48:31,085 726574 ngram.py[line:84] [NGRAM][E2E] call=18 bid=0 step_idx=21proposed=[16816, 12199, 93919, 94667, 748] found_in_context=True
INFO     2026-05-10 15:48:31,085 726574 ngram.py[line:84] [NGRAM][E2E] call=18 bid=1 step_idx=20proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
INFO     2026-05-10 15:48:31,090 726574 ngram.py[line:84] [NGRAM][E2E] call=19 bid=0 step_idx=22proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
INFO     2026-05-10 15:48:31,098 726574 ngram.py[line:84] [NGRAM][E2E] call=21 bid=1 step_idx=23proposed=[3671, 94312, 94035, 14045, 93956] found_in_context=True
INFO     2026-05-10 15:48:31,102 726574 ngram.py[line:84] [NGRAM][E2E] call=22 bid=0 step_idx=25proposed=[1472, 6946, 804, 93938, 853] found_in_context=True
INFO     2026-05-10 15:48:31,102 726574 ngram.py[line:84] [NGRAM][E2E] call=22 bid=1 step_idx=24proposed=[3, 94016, 1358, 3671, 93956] found_in_context=True
INFO     2026-05-10 15:48:31,106 726574 ngram.py[line:84] [NGRAM][E2E] call=23 bid=0 step_idx=31proposed=[4816, 93938, 10714, 93948, 23] found_in_context=True
INFO     2026-05-10 15:48:31,106 726574 ngram.py[line:84] [NGRAM][E2E] call=23 bid=1 step_idx=25proposed=[42854, 94035, 3991, 93956, 20932] found_in_context=True
INFO     2026-05-10 15:48:31,112 726574 ngram.py[line:84] [NGRAM][E2E] call=25 bid=0 step_idx=36proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
INFO     2026-05-10 15:48:31,113 726574 ngram.py[line:84] [NGRAM][E2E] call=25 bid=1 step_idx=27proposed=[23, 3671, 94312, 94035, 14045] found_in_context=True
INFO     2026-05-10 15:48:31,120 726574 ngram.py[line:84] [NGRAM][E2E] call=27 bid=1 step_idx=29proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
INFO     2026-05-10 15:48:31,123 726574 ngram.py[line:84] [NGRAM][E2E] call=28 bid=0 step_idx=39proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
INFO     2026-05-10 15:48:31,127 726574 ngram.py[line:84] [NGRAM][E2E] call=29 bid=0 step_idx=40proposed=[3099, 23, 283] found_in_context=True
INFO     2026-05-10 15:48:31,127 726574 ngram.py[line:84] [NGRAM][E2E] call=29 bid=1 step_idx=31proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
INFO     2026-05-10 15:48:31,132 726574 ngram.py[line:84] [NGRAM][E2E] call=30 bid=1 step_idx=32proposed=[4, 5, 3, 3, 94466] found_in_context=True
INFO     2026-05-10 15:48:31,142 726574 ngram.py[line:84] [NGRAM][E2E] call=33 bid=0 step_idx=44proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
INFO     2026-05-10 15:48:31,142 726574 ngram.py[line:84] [NGRAM][E2E] call=33 bid=1 step_idx=35proposed=[94016, 1358, 3671, 93956, 94405] found_in_context=True
INFO     2026-05-10 15:48:31,146 726574 ngram.py[line:84] [NGRAM][E2E] call=34 bid=0 step_idx=45proposed=[3099, 23, 283, 44055, 934] found_in_context=True
INFO     2026-05-10 15:48:31,147 726574 ngram.py[line:84] [NGRAM][E2E] call=34 bid=1 step_idx=36proposed=[93956, 73776, 93956, 94112, 96674] found_in_context=True
INFO     2026-05-10 15:48:31,154 726574 ngram.py[line:84] [NGRAM][E2E] call=35 bid=0 step_idx=46proposed=[16816, 12199, 93919, 94667, 748] found_in_context=True
INFO     2026-05-10 15:48:31,160 726574 ngram.py[line:84] [NGRAM][E2E] call=37 bid=0 step_idx=48proposed=[93938, 4816, 93938, 10714, 93948] found_in_context=True
INFO     2026-05-10 15:48:31,161 726574 ngram.py[line:84] [NGRAM][E2E] call=37 bid=1 step_idx=40proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
INFO     2026-05-10 15:48:31,166 726574 ngram.py[line:84] [NGRAM][E2E] call=38 bid=0 step_idx=49proposed=[16816, 12199, 93919, 94667, 748] found_in_context=True
INFO     2026-05-10 15:48:31,173 726574 ngram.py[line:84] [NGRAM][E2E] call=40 bid=1 step_idx=43proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
INFO     2026-05-10 15:48:31,176 726574 ngram.py[line:84] [NGRAM][E2E] call=41 bid=0 step_idx=52proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
INFO     2026-05-10 15:48:31,176 726574 ngram.py[line:84] [NGRAM][E2E] call=41 bid=1 step_idx=44proposed=[94822, 93956, 97249] found_in_context=True
INFO     2026-05-10 15:48:31,179 726574 ngram.py[line:84] [NGRAM][E2E] call=42 bid=0 step_idx=53proposed=[3099, 23, 283, 44055, 934] found_in_context=True
INFO     2026-05-10 15:48:31,183 726574 ngram.py[line:84] [NGRAM][E2E] call=43 bid=0 step_idx=54proposed=[920, 853, 93963, 37993, 28685] found_in_context=True
INFO     2026-05-10 15:48:31,183 726574 ngram.py[line:84] [NGRAM][E2E] call=43 bid=1 step_idx=47proposed=[3671, 94312, 94035, 14045, 93956] found_in_context=True
INFO     2026-05-10 15:48:31,186 726574 ngram.py[line:84] [NGRAM][E2E] call=44 bid=0 step_idx=56proposed=[93938, 10714, 93948, 23, 5] found_in_context=True
INFO     2026-05-10 15:48:31,189 726574 ngram.py[line:84] [NGRAM][E2E] call=45 bid=0 step_idx=57proposed=[16816, 12199, 93919, 94667, 748] found_in_context=True
INFO     2026-05-10 15:48:31,190 726574 ngram.py[line:84] [NGRAM][E2E] call=45 bid=1 step_idx=49proposed=[42854, 94035, 3991, 93956, 20932] found_in_context=True
INFO     2026-05-10 15:48:31,193 726574 ngram.py[line:84] [NGRAM][E2E] call=46 bid=0 step_idx=58proposed=[28685, 23, 283, 93963, 920] found_in_context=True
INFO     2026-05-10 15:48:31,200 726574 ngram.py[line:84] [NGRAM][E2E] call=48 bid=0 step_idx=60proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
INFO     2026-05-10 15:48:31,200 726574 ngram.py[line:84] [NGRAM][E2E] call=48 bid=1 step_idx=52proposed=[23, 3671, 94312, 94035, 14045] found_in_context=True
INFO     2026-05-10 15:48:31,204 726574 ngram.py[line:84] [NGRAM][E2E] call=49 bid=0 step_idx=61proposed=[3099, 23, 283, 44055, 934] found_in_context=True
INFO     2026-05-10 15:48:31,204 726574 ngram.py[line:84] [NGRAM][E2E] call=49 bid=1 step_idx=53proposed=[94035, 10985] found_in_context=True

kernel 中的 ngram 地址计算 bug 修复后日志打印结果显示能够匹配到，且存在经过 verify 后在一步 decode 中接受了多个token的情况，如：
[NGRAM-DEBUG] call=22 slt=[6, 6] step_idx=[[25], [24], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[192, 77, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=23 slt=[6, 6] step_idx=[[31], [25], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[198, 78, 0, 0, 0, 0, 0, 0]
相邻两次proposer.run()时，step_idx[0] 从 25 增加到 31，seq_len_decoder[0] 从 192 增加到 198

CUDAGraph 适配
1. proposer.run() 在 gpu_runner._postprocess() 中执行，这部分不被 CUDAGraph 录制
2. draft token 的 verify 需要一次性输入多个 token，所以会改变decode时录制的 expected_decode_len 和 batch_size，所以在 gpu worker 的 warmup 阶段需要将预计改变的形状提前录制好，需要修改 gpu_runner.capture_model() 和 FDConfig 对应的地方
3. 测试脚本的 FDConfig 默认已经开启了 CUDAGraph
Overlap Schedule 适配
1. input_ids_cpu 在 input_batch.py 中初始化时没有设定 pin_memory，不参与 overlap
2. 测试脚本开启 enable_overlap_schedule=True 后 log 中仍能够打印出正确匹配且上一步 decode 接受了多个token的情况

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-11T08:11:02Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-11T08:55:42Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-11 22:47:36

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: da51855
Merge base: d70f33d (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

存在 1 个 Required 失败任务（Approval 待审批），需处理后方可合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
19(0)	19	12	2	3	2	0

2 任务状态汇总

2.1 Required任务 : 1/2 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	9s	PR问题：PR修改了spec_decode目录，缺少FastDeploy RD审批	请 freeliuzc 或 Deleter-D 审批此 PR	Job	-
✅	其余 1 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 11/17 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Check PR Template`	13s	Job	-
⏳	`xpu_build_test / xpu-build-test`	-	Job	-
⏳	`FD-Build-Linux / fd-build`	-	Job	-
⏳	`Trigger Jenkins for PR`	-	Job	-
⏸️	`Run iluvatar Tests / run_iluvatar_cases`	-	-	-
⏸️	`CI_HPU`	-	-	-
✅	其余 11 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — 代码审批（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 代码审批
置信度: 高
根因摘要: PR修改了spec_decode目录，缺少FastDeploy RD成员审批
分析器: 通用分析(fallback)

根因详情:
PR 修改了 fastdeploy/spec_decode 和 custom_ops/gpu_ops/speculate_decoding 目录，根据 FastDeploy 代码审批规则，需要至少一位 FastDeploy RD 成员（freeliuzc(liuzichang01) 或 Deleter-D(wangyanpeng04)）的审批方可通过。当前检测到 1 个审批错误，exit code 6。

关键日志:

0. You must have one FastDeploy RD (freeliuzc(liuzichang01), Deleter-D(wangyanpeng04)) approval for modifing [fastdeploy/spec_decode,custom_ops/gpu_ops/speculate_decoding].
There are 1 approved errors.
##[error]Process completed with exit code 6.

修复建议:

请 freeliuzc(liuzichang01) 或 Deleter-D(wangyanpeng04) Review 并 Approve 此 PR

修复建议摘要: 请 freeliuzc 或 Deleter-D 审批此 PR

链接: 查看日志

codecov-commenter · 2026-05-11T09:44:58Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@d70f33d). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7774   +/-   ##
==========================================
  Coverage           ?   71.53%           
==========================================
  Files              ?      396           
  Lines              ?    55822           
  Branches           ?     8724           
==========================================
  Hits               ?    39935           
  Misses             ?    13136           
  Partials           ?     2751

Flag	Coverage Δ
GPU	`71.53% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-11 22:49:23

📋 Review 摘要

PR 概述：精简 ngram match kernel 接口（合并 input_ids/input_ids_len 到 token_ids_all），修复 ngram 指针偏移 bug，并完成端到端验证
变更范围：custom_ops/gpu_ops/speculate_decoding/、fastdeploy/spec_decode/ngram.py、fastdeploy/config.py、fastdeploy/worker/gpu_model_runner.py、测试文件
影响面 Tag：[Speculative Decoding] [OP] [FDConfig]

📝 PR 规范检查

标题含官方 Tag [Speculative Decoding] ✓；但 PR body 缺少 ## Usage or Command 和 ## Accuracy Tests 两个必填节，结构不符合描述模板要求。

PR 描述建议（可直接复制）：

## Motivation
精简 ngram match kernel 接口，将原本分离的 `input_ids`/`input_ids_len` 参数合并到 `token_ids_all`（由 `prompt_lens` 划定 prompt 与 generated tokens 边界），并修复 ngram 指针偏移 bug（`step_idx` 语义由 0-based 末尾位置索引统一为 token 计数语义）。完成 A800 单卡端到端结果验证，确认投机解码 ngram 方法的端到端正确性。

## Modifications
1. **`custom_ops/gpu_ops/speculate_decoding/ngram_match.cu` / `cpp_extensions.cc`**：删除 `input_ids`、`input_ids_len`、`input_ids_stride` 参数；GPU kernel 与 CPU fallback 均改为直接从 `token_ids_all[:, :prompt_len]` 读取 prompt（搜索域）、从 `token_ids_all[:, prompt_len:]` 读取 pre_ids（ngram 来源）；修复 ngram 指针偏移 bug：将 `cur_step_idx + 1 - ngram_size` 改为 `cur_step_idx - ngram_size`。
2. **`fastdeploy/spec_decode/ngram.py`**：删除 `input_ids_len` 相关张量及 `update()` 方法，`_run_impl` 调用签名与新 kernel 接口对齐。
3. **`fastdeploy/config.py`**：将 `SpecMethod.NGRAM` 加入 CUDAGraph capture 的 expected_decode_len 计算逻辑。
4. **`fastdeploy/worker/gpu_model_runner.py`**：`capture_model()` 中为 NGRAM 方法补充 warmup 路径，与 MTP/SUFFIX 保持一致。
5. **测试**：更新 `tests/operators/test_ngram_match.py`、`tests/spec_decode/test_benchmark_ngram_kernel.py`、`tests/spec_decode/test_ngram_gpu_kernel.py`；新增 `tests/spec_decode/test_ngram_proposer.py`。

## Usage or Command
N/A

## Accuracy Tests
端到端验证（A800 单卡）：通过打印 `token_ids_all` 读写日志确认 prompt 写入与读取一致；验证 `draft_tokens` 均在 `token_ids_all[:prompt_len + step_idx]` 范围内命中；`step_idx` 跨步增量（如 25→31）确认一次 decode 成功接受多个 speculative tokens。

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
❓ 疑问	`fastdeploy/spec_decode/ngram.py:38`	`update()` 已删除，需确认 `gpu_model_runner.py` 中无残余调用

总体评价

接口精简思路清晰，bug fix（ngram 指针偏移）通过大量 e2e 日志得到充分验证，测试覆盖完整。描述结构需补全 ## Usage or Command 和 ## Accuracy Tests 两节。

PaddlePaddle-bot · 2026-05-11T14:50:34Z

-        self.input_ids_len[bid] = seq_len
-        self.input_ids_len_gpu[bid] = seq_len

    def _run_impl(self, share_inputs):


❓ 疑问 update() 方法在此 PR 中已删除，请确认 fastdeploy/worker/gpu_model_runner.py 中（如 _postprocess 等位置）已无 proposer.update(bid, seq_len) 残余调用，否则在 NGRAM 模式下会引发 AttributeError。

已确认ngram模式下不需要此方法

Copilot

Pull request overview

该 PR 聚焦于投机解码的 ngram 路径：通过精简 ngram_match 自定义算子接口（移除独立的 input_ids/input_ids_len，统一从 token_ids_all + prompt_lens 取 prompt / pre_ids），并同步适配 NgramProposer、CUDA Graph warmup 逻辑及相关单测/benchmark，完成端到端可用性验证。

Changes:

调整 ngram_match GPU 自定义算子签名与内部寻址逻辑：prompt 从 token_ids_all 前半段读取，ngram 从 pre_ids[step_idx-ngram_size:step_idx] 读取。
适配 fastdeploy/spec_decode/ngram.py 的 proposer 调用方式，并将 SpecMethod.NGRAM 纳入 CUDA Graph capture 的 warmup 分支。
新增/更新 ngram proposer 与 kernel 的测试与 benchmark 数据构造，统一 step_idx 语义（ngram_match 场景下为“已生成 token 数”）。

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/spec_decode/test_ngram_proposer.py	新增 NgramProposer 的 CUDA 单测覆盖（无 proposal / 有 match / max_dec_len 裁剪）。
tests/spec_decode/test_ngram_gpu_kernel.py	更新 CPU 参考实现与测试数据构造以匹配新的 ngram slice 与 step_idx 语义，并适配新算子签名。
tests/spec_decode/test_benchmark_ngram_kernel.py	benchmark 数据构造与调用方式适配新算子签名（基于 token_ids_all/prompt_lens）。
tests/operators/test_ngram_match.py	operators 层单测适配新签名与 token_ids_all layout。
fastdeploy/worker/gpu_model_runner.py	CUDA Graph capture warmup 分支将 NGRAM 纳入（与 MTP/SUFFIX 同类形状覆盖）。
fastdeploy/spec_decode/ngram.py	NgramProposer 改为直接调用新签名的 ngram_match（不再依赖 input_ids_cpu/input_ids_len）。
fastdeploy/config.py	cudagraph size 初始化逻辑中将 SpecMethod.NGRAM 纳入 speculative 形状推导。
custom_ops/gpu_ops/speculate_decoding/ngram_match.cu	ngram_match kernel/CPU 路径与静态 OP 注册签名更新：移除 input_ids/input_ids_len，改用 token_ids_all+prompt_lens。
custom_ops/gpu_ops/cpp_extensions.cc	NgramMatch 扩展声明更新以匹配新接口。

refine ngram kernel signature and adapt ngram proposer logic

3aa6474

NKNaN had a problem deploying to Metax_ci May 11, 2026 08:10 — with GitHub Actions Failure

paddle-bot Bot added the contributor External developers label May 11, 2026

This comment was marked as outdated.

Sign in to view

update old unittest

da51855

NKNaN had a problem deploying to Metax_ci May 11, 2026 14:39 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 11, 2026

View reviewed changes

freeliuzc requested a review from Copilot May 12, 2026 09:29

Copilot started reviewing on behalf of freeliuzc May 12, 2026 09:29 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Speculative Decoding] Refine ngram kernel signature and adapt ngram proposer#7774

[Speculative Decoding] Refine ngram kernel signature and adapt ngram proposer#7774
NKNaN wants to merge 2 commits into
PaddlePaddle:developfrom
NKNaN:spec-ngram

NKNaN commented May 11, 2026

Uh oh!

paddle-bot Bot commented May 11, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 11, 2026 •

edited

Loading

Approval

Uh oh!

codecov-commenter commented May 11, 2026 •

edited

Loading

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 11, 2026

Uh oh!

NKNaN May 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

NKNaN commented May 11, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 11, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 1/2 通过

2.2 可选任务 — 11/17 通过

3 失败详情（仅 required）

Approval

Uh oh!

codecov-commenter commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

NKNaN May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PaddlePaddle-bot commented May 11, 2026 •

edited

Loading

codecov-commenter commented May 11, 2026 •

edited

Loading