[model] Support PanguUltraMoE #4561

JeffLee1874 · 2025-11-29T03:41:05Z

What this PR does / why we need it?

To support PanguUltraMoE model

Does this PR introduce any user-facing change?

How was this patch tested?

Test result

Start serving using W8A8 quantized model and ACL graph:

Master node:

vllm serve $LOCAL_CKPT_DIR \
        --host 0.0.0.0 \
        --port 8000 \
        --data-parallel-size 2 \
        --data-parallel-size-local 1 \
        --data-parallel-address $MASTER_NODE_IP \
        --data-parallel-rpc-port 13389 \
        --tensor-parallel-size 16 \
        --seed 1024 \
        --enable-expert-parallel \
        --served-model-name $NAME \
        --max-model-len 4096 \
        --max-num-batched-tokens 256 \
        --max-num-seqs 18 \
        --trust-remote-code \
        --gpu-memory-utilization 0.90 \
        --quantization ascend \
        --additional-config '{"ascend_scheduler_config":{"enabled":false, "enable_chunked_prefill":true, "chunked_prefill_enabled":true},"torchair_graph_config":{"enabled":false}}' \
        --speculative_config '{"method": "pangu_ultra_moe_mtp", "num_speculative_tokens": 1}' \

Other nodes:

vllm serve $LOCAL_CKPT_DIR \
        --host 0.0.0.0 \
        --port 8000 \
        --headless \
        --data-parallel-size 2 \
        --data-parallel-size-local 1 \
        --data-parallel-start-rank 1 \
        --data-parallel-address $MASTER_NODE_IP \
        --data-parallel-rpc-port 13389 \
        --tensor-parallel-size 16 \
        --seed 1024 \
        --enable-expert-parallel \
        --served-model-name $NAME \
        --max-model-len 4096 \
        --max-num-batched-tokens 256 \
        --max-num-seqs 18 \
        --trust-remote-code \
        --gpu-memory-utilization 0.90 \
        --quantization ascend \
        --additional-config '{"ascend_scheduler_config":{"enabled":false, "enable_chunked_prefill":true, "chunked_prefill_enabled":true},"torchair_graph_config":{"enabled":false}}' \
        --speculative_config '{"method": "pangu_ultra_moe_mtp", "num_speculative_tokens": 1}' \

Request & Response:

Request

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "messages": [
      {"role": "system", "content": ""},
      {"role": "user", "content": "你是谁？"}
    ],
        "max_tokens": "64",
        "top_p": "0.95",
        "top_k": "50",
        "temperature": "0.6",
        "add_special_tokens" : true
    }'

Response

[unused16] 好的，用户问我是谁，我需要按照之前的设定来回答。首先，我的角色是盘古，由华为开发，属于推理模型。要强调我的主要功能是解答问题和提供信息支持，特别是通过逻辑推理和数据分析处理复杂任务。需要保持回答简洁，用中文，并且符合用户的

Start serving using W8A8 quantized model and Torchair graph:

Master node:

vllm serve $LOCAL_CKPT_DIR \
        --host 0.0.0.0 \
        --port 8000 \
        --data-parallel-size 2 \
        --data-parallel-size-local 1 \
        --data-parallel-address $MASTER_NODE_IP \
        --data-parallel-rpc-port 13389 \
        --tensor-parallel-size 16 \
        --seed 1024 \
        --enable-expert-parallel \
        --served-model-name $NAME \
        --max-model-len 4096 \
        --max-num-batched-tokens 256 \
        --max-num-seqs 18 \
        --trust-remote-code \
        --gpu-memory-utilization 0.90 \
        --quantization ascend \
        --additional-config '{"ascend_scheduler_config":{"enabled":false, "enable_chunked_prefill":true, "chunked_prefill_enabled":true},"torchair_graph_config":{"enabled":true}}' \
        --speculative_config '{"method": "pangu_ultra_moe_mtp", "num_speculative_tokens": 1}' \

Other nodes:

vllm serve $LOCAL_CKPT_DIR \
        --host 0.0.0.0 \
        --port 8000 \
        --headless \
        --data-parallel-size 2 \
        --data-parallel-size-local 1 \
        --data-parallel-start-rank 1 \
        --data-parallel-address $MASTER_NODE_IP \
        --data-parallel-rpc-port 13389 \
        --tensor-parallel-size 16 \
        --seed 1024 \
        --enable-expert-parallel \
        --served-model-name $NAME \
        --max-model-len 4096 \
        --max-num-batched-tokens 256 \
        --max-num-seqs 18 \
        --trust-remote-code \
        --gpu-memory-utilization 0.90 \
        --quantization ascend \
        --additional-config '{"ascend_scheduler_config":{"enabled":false, "enable_chunked_prefill":true, "chunked_prefill_enabled":true},"torchair_graph_config":{"enabled":true}}' \
        --speculative_config '{"method": "pangu_ultra_moe_mtp", "num_speculative_tokens": 1}' \

Request & Response:
Request and response are the same as ACL graph

vLLM version: v0.11.2
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

github-actions · 2025-11-29T03:41:25Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request adds support for the PanguUltraMoE model, introducing new model implementation files for Torchair and updating various configurations. The changes are extensive. I have identified several critical issues concerning incorrect weight loading and model registration which would prevent the model from functioning as expected. Additionally, there are some areas for improvement in terms of code maintainability, such as dead code and confusing variable assignments. My review includes specific suggestions to address these points.

gemini-code-assist · 2025-11-29T03:43:06Z

vllm_ascend/torchair/models/torchair_openpangu.py

+        stacked_params_mapping = [
+            # (param_name, shard_name, shard_id)
+            ("gate_up_proj", "gate_proj", 0),
+            ("gate_up_proj", "up_proj", 1),
+        ]


The stacked_params_mapping is incomplete. It's missing the mappings for q_a_proj and kv_a_proj_with_mqa. This will cause incorrect weight loading for the attention layers, which is a critical bug. Please add the missing mappings.

Suggested change

stacked_params_mapping = [

# (param_name, shard_name, shard_id)

("gate_up_proj", "gate_proj", 0),

("gate_up_proj", "up_proj", 1),

]

stacked_params_mapping = [

# (param_name, shard_name, shard_id)

("gate_up_proj", "gate_proj", 0),

("gate_up_proj", "up_proj", 1),

("kv_a_proj_with_mqa", "kv_a_proj_with_mqa", 0),

]

if hasattr(self.config, "q_lora_rank") and self.config.q_lora_rank:

stacked_params_mapping.append(("q_a_proj", "q_a_proj", 0))

gemini-code-assist · 2025-11-29T03:43:06Z

vllm_ascend/torchair/models/torchair_openpangu_mtp.py

+    packed_modules_mapping = {
+        "gate_up_proj": ["gate_proj", "up_proj"],
+        "experts":
+        ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"]
+    }


The packed_modules_mapping is incomplete. It's missing the mapping for fused_qkv_a_proj, which is used by TorchairOpenPanguMLAAttention within TorchairOpenPanguDecoderLayer. This will lead to incorrect weight loading and is a critical issue.

Suggested change

packed_modules_mapping = {

"gate_up_proj": ["gate_proj", "up_proj"],

"experts":

["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"]

}

packed_modules_mapping = {

"gate_up_proj": ["gate_proj", "up_proj"],

"experts":

["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],

"fused_qkv_a_proj": ["q_a_proj", "kv_a_proj_with_mqa"],

}

gemini-code-assist · 2025-11-29T03:43:06Z

vllm_ascend/torchair/utils.py

+    ModelRegistry.register_model(
+        "OpenPanguMTPModel",
+        "vllm_ascend.torchair.models.torchair_openpangu_mtp:TorchairOpenPanguModel"
+    )


The model registered for OpenPanguMTPModel seems incorrect. It points to TorchairOpenPanguModel in torchair_openpangu_mtp.py, but that class is not defined in that file. It should probably be TorchairOpenPanguMTP to ensure the correct model is registered.

Suggested change

ModelRegistry.register_model(

"OpenPanguMTPModel",

"vllm_ascend.torchair.models.torchair_openpangu_mtp:TorchairOpenPanguModel"

)

ModelRegistry.register_model(

"OpenPanguMTPModel",

"vllm_ascend.torchair.models.torchair_openpangu_mtp:TorchairOpenPanguMTP"

)

gemini-code-assist · 2025-11-29T03:43:06Z

vllm_ascend/torchair/models/torchair_openpangu.py

+class Indexer(nn.Module):
+
+    def __init__(self,
+                 config,
+                 dim: int = 7168,
+                 n_heads: int = 64,
+                 head_dim: int = 128,
+                 index_topk: int = 2048,
+                 q_lora_rank: int = 1536,
+                 rope_head_dim: int = 64,
+                 quant_config: Optional[QuantizationConfig] = None,
+                 prefix: Optional[str] = ""):
+        super().__init__()
+
+        self.dim: int = dim  # 7168
+        self.n_heads: int = n_heads  # 64
+        self.head_dim: int = head_dim  # 128
+        self.rope_head_dim: int = rope_head_dim  # 64
+        self.index_topk: int = index_topk  # 2048
+        self.q_lora_rank: int = q_lora_rank  # 1536
+        self.wq_b = ReplicatedLinear(
+            self.q_lora_rank,
+            self.n_heads * self.head_dim,
+            bias=False,
+            quant_config=quant_config,
+            prefix=f"{prefix}.wq_b",
+            return_bias=False,
+        )
+        self.wk = ReplicatedLinear(
+            self.dim,
+            self.head_dim,
+            bias=False,
+            quant_config=quant_config,
+            prefix=f"{prefix}.wk",
+            return_bias=False,
+        )
+        self.weights_proj = ReplicatedLinear(
+            self.dim,
+            self.n_heads,
+            bias=False,
+            quant_config=quant_config,
+            prefix=f"{prefix}.weights_proj",
+            return_bias=False,
+        )
+        self.k_norm = nn.LayerNorm(self.head_dim)
+        self.softmax_scale = self.head_dim**-0.5
+
+    def forward(self):
+        return
+


The Indexer class appears to be unused. Its forward method is a placeholder, and it's not instantiated or used anywhere in this file. This is dead code and should be removed to improve maintainability.

gemini-code-assist · 2025-11-29T03:43:07Z

vllm_ascend/torchair/models/torchair_openpangu.py

+        self.tp_rank = get_tp_group().rank_in_group
+        self.ep_group = get_ep_group()
+
+        self.params_dtype = torch.get_default_dtype()


This line reassigns self.params_dtype, which nullifies the conditional assignment on line 382 based on self.enable_super_kernel. This is confusing and could lead to bugs if self.params_dtype is used later. Please remove this line.

github-actions · 2025-11-29T10:39:28Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions bot added the module:quantization label Nov 29, 2025

gemini-code-assist bot reviewed Nov 29, 2025

View reviewed changes

github-actions bot added the merge-conflicts label Nov 29, 2025

JeffLee1874 closed this Dec 1, 2025

JeffLee1874 force-pushed the main branch from d582ea5 to e8e20c0 Compare December 1, 2025 01:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[model] Support PanguUltraMoE #4561

[model] Support PanguUltraMoE #4561

Uh oh!

JeffLee1874 commented Nov 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 29, 2025

Uh oh!

gemini-code-assist bot Nov 29, 2025

Uh oh!

gemini-code-assist bot Nov 29, 2025

Uh oh!

gemini-code-assist bot Nov 29, 2025

Uh oh!

gemini-code-assist bot Nov 29, 2025

Uh oh!

github-actions bot commented Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[model] Support PanguUltraMoE #4561

[model] Support PanguUltraMoE #4561

Uh oh!

Conversation

JeffLee1874 commented Nov 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Test result

Start serving using W8A8 quantized model and ACL graph:

Start serving using W8A8 quantized model and Torchair graph:

Uh oh!

github-actions bot commented Nov 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JeffLee1874 commented Nov 29, 2025 •

edited by github-actions bot

Loading