fix: Npu Group MatMul op patchs only in EP by 0hujun · Pull Request #205 · modelscope/twinkle

0hujun · 2026-05-28T07:22:23Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

When Expert Parallelism (EP) is not enabled, each rank holds all expert weights.
weight.transpose(-2, -1) produces a large non-contiguous view that npu_grouped_matmul
forces to .contiguous() (~12.88 GB per MoE layer), creating a bandwidth bottleneck
that makes the NPU GMM patch ~8x slower than the native per-expert fallback.

When EP is enabled, each rank holds only a subset of expert weights — small and
contiguous — making npu_grouped_matmul efficient.

Experiment results

TWINKLE_NPU_GMM_PATCH=1

	EP>1	EP=1	Ratio
step 0 (warmup)	74.8s	126.6s	1.7x
step 1 (warmup)	25.8s	114.1s	4.4x
steps 3-9 avg (steady)	~14.4s	~118s	~8.2x

gemini-code-assist

Code Review

This pull request updates the NPU monkey patching logic to make the HuggingFace MoE Grouped MatMul (GMM) patch Expert Parallelism (EP) aware. It introduces _is_ep_enabled to check if EP is active and modifies the patching logic to skip GMM patching by default or when EP is not enabled, avoiding significant overhead from contiguous copies on transposed weights. The review comments point out two critical issues: a potential AttributeError when accessing model.device_mesh directly, and a logical contradiction where the default value for TWINKLE_NPU_GMM_PATCH is set to True instead of False, which conflicts with the intended behavior of skipping the patch by default when unset.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

fix: Npu Group MatMul op patchs only in EP

98e69cd

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

Comment thread src/twinkle/kernel/monkey_patch_npu.py Outdated

Comment thread src/twinkle/kernel/monkey_patch_npu.py Outdated

0hujun and others added 2 commits May 28, 2026 15:24

Update src/twinkle/kernel/monkey_patch_npu.py

1992ca0

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update src/twinkle/kernel/monkey_patch_npu.py

598c5ab

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

tastelikefeet approved these changes May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Npu Group MatMul op patchs only in EP#205

fix: Npu Group MatMul op patchs only in EP#205
0hujun wants to merge 3 commits into
modelscope:mainfrom
0hujun:main

0hujun commented May 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

0hujun commented May 28, 2026

PR type

PR information

Experiment results

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants