Skip to content

Conversation

@JeffLee1874
Copy link

@JeffLee1874 JeffLee1874 commented Nov 29, 2025

What this PR does / why we need it?

To support PanguUltraMoE model

Does this PR introduce any user-facing change?

How was this patch tested?

Test result

Start serving using W8A8 quantized model and ACL graph:

Master node:

vllm serve $LOCAL_CKPT_DIR \
        --host 0.0.0.0 \
        --port 8000 \
        --data-parallel-size 2 \
        --data-parallel-size-local 1 \
        --data-parallel-address $MASTER_NODE_IP \
        --data-parallel-rpc-port 13389 \
        --tensor-parallel-size 16 \
        --seed 1024 \
        --enable-expert-parallel \
        --served-model-name $NAME \
        --max-model-len 4096 \
        --max-num-batched-tokens 256 \
        --max-num-seqs 18 \
        --trust-remote-code \
        --gpu-memory-utilization 0.90 \
        --quantization ascend \
        --additional-config '{"ascend_scheduler_config":{"enabled":false, "enable_chunked_prefill":true, "chunked_prefill_enabled":true},"torchair_graph_config":{"enabled":false}}' \
        --speculative_config '{"method": "pangu_ultra_moe_mtp", "num_speculative_tokens": 1}' \

Other nodes:

vllm serve $LOCAL_CKPT_DIR \
        --host 0.0.0.0 \
        --port 8000 \
        --headless \
        --data-parallel-size 2 \
        --data-parallel-size-local 1 \
        --data-parallel-start-rank 1 \
        --data-parallel-address $MASTER_NODE_IP \
        --data-parallel-rpc-port 13389 \
        --tensor-parallel-size 16 \
        --seed 1024 \
        --enable-expert-parallel \
        --served-model-name $NAME \
        --max-model-len 4096 \
        --max-num-batched-tokens 256 \
        --max-num-seqs 18 \
        --trust-remote-code \
        --gpu-memory-utilization 0.90 \
        --quantization ascend \
        --additional-config '{"ascend_scheduler_config":{"enabled":false, "enable_chunked_prefill":true, "chunked_prefill_enabled":true},"torchair_graph_config":{"enabled":false}}' \
        --speculative_config '{"method": "pangu_ultra_moe_mtp", "num_speculative_tokens": 1}' \

Request & Response:

  • Request
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "messages": [
      {"role": "system", "content": ""},
      {"role": "user", "content": "你是谁?"}
    ],
        "max_tokens": "64",
        "top_p": "0.95",
        "top_k": "50",
        "temperature": "0.6",
        "add_special_tokens" : true
    }'
  • Response
[unused16] 好的,用户问我是谁,我需要按照之前的设定来回答。首先,我的角色是盘古,由华为开发,属于推理模型。要强调我的主要功能是解答问题和提供信息支持,特别是通过逻辑推理和数据分析处理复杂任务。需要保持回答简洁,用中文,并且符合用户的

Start serving using W8A8 quantized model and Torchair graph:

Master node:

vllm serve $LOCAL_CKPT_DIR \
        --host 0.0.0.0 \
        --port 8000 \
        --data-parallel-size 2 \
        --data-parallel-size-local 1 \
        --data-parallel-address $MASTER_NODE_IP \
        --data-parallel-rpc-port 13389 \
        --tensor-parallel-size 16 \
        --seed 1024 \
        --enable-expert-parallel \
        --served-model-name $NAME \
        --max-model-len 4096 \
        --max-num-batched-tokens 256 \
        --max-num-seqs 18 \
        --trust-remote-code \
        --gpu-memory-utilization 0.90 \
        --quantization ascend \
        --additional-config '{"ascend_scheduler_config":{"enabled":false, "enable_chunked_prefill":true, "chunked_prefill_enabled":true},"torchair_graph_config":{"enabled":true}}' \
        --speculative_config '{"method": "pangu_ultra_moe_mtp", "num_speculative_tokens": 1}' \

Other nodes:

vllm serve $LOCAL_CKPT_DIR \
        --host 0.0.0.0 \
        --port 8000 \
        --headless \
        --data-parallel-size 2 \
        --data-parallel-size-local 1 \
        --data-parallel-start-rank 1 \
        --data-parallel-address $MASTER_NODE_IP \
        --data-parallel-rpc-port 13389 \
        --tensor-parallel-size 16 \
        --seed 1024 \
        --enable-expert-parallel \
        --served-model-name $NAME \
        --max-model-len 4096 \
        --max-num-batched-tokens 256 \
        --max-num-seqs 18 \
        --trust-remote-code \
        --gpu-memory-utilization 0.90 \
        --quantization ascend \
        --additional-config '{"ascend_scheduler_config":{"enabled":false, "enable_chunked_prefill":true, "chunked_prefill_enabled":true},"torchair_graph_config":{"enabled":true}}' \
        --speculative_config '{"method": "pangu_ultra_moe_mtp", "num_speculative_tokens": 1}' \

Request & Response:
Request and response are the same as ACL graph

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the PanguUltraMoE model, introducing new model implementation files for Torchair and updating various configurations. The changes are extensive. I have identified several critical issues concerning incorrect weight loading and model registration which would prevent the model from functioning as expected. Additionally, there are some areas for improvement in terms of code maintainability, such as dead code and confusing variable assignments. My review includes specific suggestions to address these points.

Comment on lines 1037 to 1041
stacked_params_mapping = [
# (param_name, shard_name, shard_id)
("gate_up_proj", "gate_proj", 0),
("gate_up_proj", "up_proj", 1),
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The stacked_params_mapping is incomplete. It's missing the mappings for q_a_proj and kv_a_proj_with_mqa. This will cause incorrect weight loading for the attention layers, which is a critical bug. Please add the missing mappings.

Suggested change
stacked_params_mapping = [
# (param_name, shard_name, shard_id)
("gate_up_proj", "gate_proj", 0),
("gate_up_proj", "up_proj", 1),
]
stacked_params_mapping = [
# (param_name, shard_name, shard_id)
("gate_up_proj", "gate_proj", 0),
("gate_up_proj", "up_proj", 1),
("kv_a_proj_with_mqa", "kv_a_proj_with_mqa", 0),
]
if hasattr(self.config, "q_lora_rank") and self.config.q_lora_rank:
stacked_params_mapping.append(("q_a_proj", "q_a_proj", 0))

Comment on lines 195 to 199
packed_modules_mapping = {
"gate_up_proj": ["gate_proj", "up_proj"],
"experts":
["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The packed_modules_mapping is incomplete. It's missing the mapping for fused_qkv_a_proj, which is used by TorchairOpenPanguMLAAttention within TorchairOpenPanguDecoderLayer. This will lead to incorrect weight loading and is a critical issue.

Suggested change
packed_modules_mapping = {
"gate_up_proj": ["gate_proj", "up_proj"],
"experts":
["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"]
}
packed_modules_mapping = {
"gate_up_proj": ["gate_proj", "up_proj"],
"experts":
["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"],
"fused_qkv_a_proj": ["q_a_proj", "kv_a_proj_with_mqa"],
}

Comment on lines 164 to 167
ModelRegistry.register_model(
"OpenPanguMTPModel",
"vllm_ascend.torchair.models.torchair_openpangu_mtp:TorchairOpenPanguModel"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The model registered for OpenPanguMTPModel seems incorrect. It points to TorchairOpenPanguModel in torchair_openpangu_mtp.py, but that class is not defined in that file. It should probably be TorchairOpenPanguMTP to ensure the correct model is registered.

Suggested change
ModelRegistry.register_model(
"OpenPanguMTPModel",
"vllm_ascend.torchair.models.torchair_openpangu_mtp:TorchairOpenPanguModel"
)
ModelRegistry.register_model(
"OpenPanguMTPModel",
"vllm_ascend.torchair.models.torchair_openpangu_mtp:TorchairOpenPanguMTP"
)

Comment on lines 80 to 129
class Indexer(nn.Module):

def __init__(self,
config,
dim: int = 7168,
n_heads: int = 64,
head_dim: int = 128,
index_topk: int = 2048,
q_lora_rank: int = 1536,
rope_head_dim: int = 64,
quant_config: Optional[QuantizationConfig] = None,
prefix: Optional[str] = ""):
super().__init__()

self.dim: int = dim # 7168
self.n_heads: int = n_heads # 64
self.head_dim: int = head_dim # 128
self.rope_head_dim: int = rope_head_dim # 64
self.index_topk: int = index_topk # 2048
self.q_lora_rank: int = q_lora_rank # 1536
self.wq_b = ReplicatedLinear(
self.q_lora_rank,
self.n_heads * self.head_dim,
bias=False,
quant_config=quant_config,
prefix=f"{prefix}.wq_b",
return_bias=False,
)
self.wk = ReplicatedLinear(
self.dim,
self.head_dim,
bias=False,
quant_config=quant_config,
prefix=f"{prefix}.wk",
return_bias=False,
)
self.weights_proj = ReplicatedLinear(
self.dim,
self.n_heads,
bias=False,
quant_config=quant_config,
prefix=f"{prefix}.weights_proj",
return_bias=False,
)
self.k_norm = nn.LayerNorm(self.head_dim)
self.softmax_scale = self.head_dim**-0.5

def forward(self):
return

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Indexer class appears to be unused. Its forward method is a placeholder, and it's not instantiated or used anywhere in this file. This is dead code and should be removed to improve maintainability.

self.tp_rank = get_tp_group().rank_in_group
self.ep_group = get_ep_group()

self.params_dtype = torch.get_default_dtype()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This line reassigns self.params_dtype, which nullifies the conditional assignment on line 382 based on self.enable_super_kernel. This is confusing and could lead to bugs if self.params_dtype is used later. Please remove this line.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant