Skip to content

ms-swift微调之后的qwen3-next使用vllm部署开启mtp推测解码报错,不开启推测解码能正常部署 #6221

@aoboxia

Description

@aoboxia

部署参数:
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve /xxx/qwen3-next/20251017/hf-full-202510171312 --served-model-name qwen3-next --port 9001 --trust-remote-code --tensor-parallel-size 2 --gpu-memory-utilization 0.90 --enable-prefix-caching --max-model-len 4096 --reasoning-parser deepseek_r1 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

使用该参数vllm直接部署qwen-next原始模型无错误

报错日志:

(Worker_TP0 pid=1766) ERROR 10-20 16:59:32 [multiproc_executor.py:597] Traceback (most recent call last):
(Worker_TP0 pid=1766) ERROR 10-20 16:59:32 [multiproc_executor.py:597]   File "/usr/local/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 571, in worker_main
(Worker_TP0 pid=1766) ERROR 10-20 16:59:32 [multiproc_executor.py:597]     worker = WorkerProc(*args, **kwargs)
(Worker_TP0 pid=1766) ERROR 10-20 16:59:32 [multiproc_executor.py:597]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1766) ERROR 10-20 16:59:32 [multiproc_executor.py:597]   File "/usr/local/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 437, in __init__
(Worker_TP0 pid=1766) ERROR 10-20 16:59:32 [multiproc_executor.py:597]     self.worker.load_model()
(Worker_TP0 pid=1766) ERROR 10-20 16:59:32 [multiproc_executor.py:597]   File "/usr/local/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 213, in load_model
(Worker_TP0 pid=1766) ERROR 10-20 16:59:32 [multiproc_executor.py:597]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_TP0 pid=1766) ERROR 10-20 16:59:32 [multiproc_executor.py:597]   File "/usr/local/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2642, in load_model
(Worker_TP0 pid=1766) ERROR 10-20 16:59:32 [multiproc_executor.py:597]     self.drafter.load_model(self.model)
(Worker_TP0 pid=1766) ERROR 10-20 16:59:32 [multiproc_executor.py:597]   File "/usr/local/lib/python3.11/site-packages/vllm/v1/spec_decode/eagle.py", line 821, in load_model
(Worker_TP0 pid=1766) ERROR 10-20 16:59:32 [multiproc_executor.py:597]     self.model = get_model(vllm_config=self.vllm_config,
(Worker_TP0 pid=1766) ERROR 10-20 16:59:32 [multiproc_executor.py:597]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1766) ERROR 10-20 16:59:32 [multiproc_executor.py:597]   File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/model_loader/__init__.py", line 119, in get_model
(Worker_TP0 pid=1766) ERROR 10-20 16:59:32 [multiproc_executor.py:597]     return loader.load_model(vllm_config=vllm_config,
(Worker_TP0 pid=1766) ERROR 10-20 16:59:32 [multiproc_executor.py:597]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1766) ERROR 10-20 16:59:32 [multiproc_executor.py:597]   File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
(Worker_TP0 pid=1766) ERROR 10-20 16:59:32 [multiproc_executor.py:597]     self.load_weights(model, model_config)
(Worker_TP0 pid=1766) ERROR 10-20 16:59:32 [multiproc_executor.py:597]   File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/model_loader/default_loader.py", line 276, in load_weights
(Worker_TP0 pid=1766) ERROR 10-20 16:59:32 [multiproc_executor.py:597]     raise ValueError("Following weights were not initialized from "
(Worker_TP0 pid=1766) ERROR 10-20 16:59:32 [multiproc_executor.py:597] ValueError: Following weights were not initialized from checkpoint: {'model.layers.0.self_attn.q_norm.weight', 'model.pre_fc_norm_hidden.weight', 'model.layers.0.post_attention_layernorm.weight', 'model.layers.0.mlp.shared_expert.gate_up_proj.weight', 'model.layers.0.self_attn.qkv_proj.weight', 'model.layers.0.mlp.shared_expert_gate.weight', 'model.layers.0.input_layernorm.weight', 'model.layers.0.mlp.experts.w2_weight', 'model.layers.0.mlp.experts.w13_weight', 'model.fc.weight', 'model.layers.0.self_attn.o_proj.weight', 'model.norm.weight', 'model.layers.0.self_attn.k_norm.weight', 'model.layers.0.mlp.shared_expert.down_proj.weight', 'model.pre_fc_norm_embedding.weight', 'model.layers.0.mlp.gate.weight'}```


config.json:

```{
  "architectures": [
    "Qwen3NextForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "decoder_sparse_step": 1,
  "eos_token_id": 151645,
  "full_attention_interval": 4,
  "head_dim": 256,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 5120,
  "linear_conv_kernel_dim": 4,
  "linear_key_head_dim": 128,
  "linear_num_key_heads": 16,
  "linear_num_value_heads": 32,
  "linear_value_head_dim": 128,
  "max_position_embeddings": 262144,
  "mlp_only_layers": [],
  "model_type": "qwen3_next",
  "moe_intermediate_size": 512,
  "norm_topk_prob": true,
  "num_attention_heads": 16,
  "num_experts": 512,
  "num_experts_per_tok": 10,
  "num_hidden_layers": 48,
  "num_key_value_heads": 2,
  "output_router_logits": false,
  "partial_rotary_factor": 0.25,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000000,
  "router_aux_loss_coef": 0.001,
  "shared_expert_intermediate_size": 512,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.57.1",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 151936
}```


Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions