-
Notifications
You must be signed in to change notification settings - Fork 645
[model] Support PanguUltraMoE #4561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the PanguUltraMoE model, introducing new model implementation files for Torchair and updating various configurations. The changes are extensive. I have identified several critical issues concerning incorrect weight loading and model registration which would prevent the model from functioning as expected. Additionally, there are some areas for improvement in terms of code maintainability, such as dead code and confusing variable assignments. My review includes specific suggestions to address these points.
| stacked_params_mapping = [ | ||
| # (param_name, shard_name, shard_id) | ||
| ("gate_up_proj", "gate_proj", 0), | ||
| ("gate_up_proj", "up_proj", 1), | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stacked_params_mapping is incomplete. It's missing the mappings for q_a_proj and kv_a_proj_with_mqa. This will cause incorrect weight loading for the attention layers, which is a critical bug. Please add the missing mappings.
| stacked_params_mapping = [ | |
| # (param_name, shard_name, shard_id) | |
| ("gate_up_proj", "gate_proj", 0), | |
| ("gate_up_proj", "up_proj", 1), | |
| ] | |
| stacked_params_mapping = [ | |
| # (param_name, shard_name, shard_id) | |
| ("gate_up_proj", "gate_proj", 0), | |
| ("gate_up_proj", "up_proj", 1), | |
| ("kv_a_proj_with_mqa", "kv_a_proj_with_mqa", 0), | |
| ] | |
| if hasattr(self.config, "q_lora_rank") and self.config.q_lora_rank: | |
| stacked_params_mapping.append(("q_a_proj", "q_a_proj", 0)) |
| packed_modules_mapping = { | ||
| "gate_up_proj": ["gate_proj", "up_proj"], | ||
| "experts": | ||
| ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"] | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The packed_modules_mapping is incomplete. It's missing the mapping for fused_qkv_a_proj, which is used by TorchairOpenPanguMLAAttention within TorchairOpenPanguDecoderLayer. This will lead to incorrect weight loading and is a critical issue.
| packed_modules_mapping = { | |
| "gate_up_proj": ["gate_proj", "up_proj"], | |
| "experts": | |
| ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"] | |
| } | |
| packed_modules_mapping = { | |
| "gate_up_proj": ["gate_proj", "up_proj"], | |
| "experts": | |
| ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"], | |
| "fused_qkv_a_proj": ["q_a_proj", "kv_a_proj_with_mqa"], | |
| } |
vllm_ascend/torchair/utils.py
Outdated
| ModelRegistry.register_model( | ||
| "OpenPanguMTPModel", | ||
| "vllm_ascend.torchair.models.torchair_openpangu_mtp:TorchairOpenPanguModel" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model registered for OpenPanguMTPModel seems incorrect. It points to TorchairOpenPanguModel in torchair_openpangu_mtp.py, but that class is not defined in that file. It should probably be TorchairOpenPanguMTP to ensure the correct model is registered.
| ModelRegistry.register_model( | |
| "OpenPanguMTPModel", | |
| "vllm_ascend.torchair.models.torchair_openpangu_mtp:TorchairOpenPanguModel" | |
| ) | |
| ModelRegistry.register_model( | |
| "OpenPanguMTPModel", | |
| "vllm_ascend.torchair.models.torchair_openpangu_mtp:TorchairOpenPanguMTP" | |
| ) |
| class Indexer(nn.Module): | ||
|
|
||
| def __init__(self, | ||
| config, | ||
| dim: int = 7168, | ||
| n_heads: int = 64, | ||
| head_dim: int = 128, | ||
| index_topk: int = 2048, | ||
| q_lora_rank: int = 1536, | ||
| rope_head_dim: int = 64, | ||
| quant_config: Optional[QuantizationConfig] = None, | ||
| prefix: Optional[str] = ""): | ||
| super().__init__() | ||
|
|
||
| self.dim: int = dim # 7168 | ||
| self.n_heads: int = n_heads # 64 | ||
| self.head_dim: int = head_dim # 128 | ||
| self.rope_head_dim: int = rope_head_dim # 64 | ||
| self.index_topk: int = index_topk # 2048 | ||
| self.q_lora_rank: int = q_lora_rank # 1536 | ||
| self.wq_b = ReplicatedLinear( | ||
| self.q_lora_rank, | ||
| self.n_heads * self.head_dim, | ||
| bias=False, | ||
| quant_config=quant_config, | ||
| prefix=f"{prefix}.wq_b", | ||
| return_bias=False, | ||
| ) | ||
| self.wk = ReplicatedLinear( | ||
| self.dim, | ||
| self.head_dim, | ||
| bias=False, | ||
| quant_config=quant_config, | ||
| prefix=f"{prefix}.wk", | ||
| return_bias=False, | ||
| ) | ||
| self.weights_proj = ReplicatedLinear( | ||
| self.dim, | ||
| self.n_heads, | ||
| bias=False, | ||
| quant_config=quant_config, | ||
| prefix=f"{prefix}.weights_proj", | ||
| return_bias=False, | ||
| ) | ||
| self.k_norm = nn.LayerNorm(self.head_dim) | ||
| self.softmax_scale = self.head_dim**-0.5 | ||
|
|
||
| def forward(self): | ||
| return | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self.tp_rank = get_tp_group().rank_in_group | ||
| self.ep_group = get_ep_group() | ||
|
|
||
| self.params_dtype = torch.get_default_dtype() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
To support PanguUltraMoE model
Does this PR introduce any user-facing change?
How was this patch tested?
Test result
Start serving using W8A8 quantized model and ACL graph:
Master node:
Other nodes:
Request & Response:
Start serving using W8A8 quantized model and Torchair graph:
Master node:
Other nodes:
Request & Response:
Request and response are the same as ACL graph