Skip to content

fix: Npu Group MatMul op patchs only in EP#205

Open
0hujun wants to merge 3 commits into
modelscope:mainfrom
0hujun:main
Open

fix: Npu Group MatMul op patchs only in EP#205
0hujun wants to merge 3 commits into
modelscope:mainfrom
0hujun:main

Conversation

@0hujun
Copy link
Copy Markdown
Contributor

@0hujun 0hujun commented May 28, 2026

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

When Expert Parallelism (EP) is not enabled, each rank holds all expert weights.
weight.transpose(-2, -1) produces a large non-contiguous view that npu_grouped_matmul
forces to .contiguous() (~12.88 GB per MoE layer), creating a bandwidth bottleneck
that makes the NPU GMM patch ~8x slower than the native per-expert fallback.

When EP is enabled, each rank holds only a subset of expert weights — small and
contiguous — making npu_grouped_matmul efficient.

Experiment results

TWINKLE_NPU_GMM_PATCH=1

EP>1 EP=1 Ratio
step 0 (warmup) 74.8s 126.6s 1.7x
step 1 (warmup) 25.8s 114.1s 4.4x
steps 3-9 avg (steady) ~14.4s ~118s ~8.2x

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the NPU monkey patching logic to make the HuggingFace MoE Grouped MatMul (GMM) patch Expert Parallelism (EP) aware. It introduces _is_ep_enabled to check if EP is active and modifies the patching logic to skip GMM patching by default or when EP is not enabled, avoiding significant overhead from contiguous copies on transposed weights. The review comments point out two critical issues: a potential AttributeError when accessing model.device_mesh directly, and a logical contradiction where the default value for TWINKLE_NPU_GMM_PATCH is set to True instead of False, which conflicts with the intended behavior of skipping the patch by default when unset.

Comment thread src/twinkle/kernel/monkey_patch_npu.py Outdated
Comment thread src/twinkle/kernel/monkey_patch_npu.py Outdated
0hujun and others added 2 commits May 28, 2026 15:24
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants