-
Notifications
You must be signed in to change notification settings - Fork 661
[XPU] [Optimization] [EP] EP communication optimization. #5145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Thanks for your contribution! |
| self.group = group | ||
| self.num_local_experts = num_experts // ep_size | ||
| self.deepep_engine = None | ||
| self.deepep_engine = None # deepep_engine只调用dispatch, combine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注释都用英文
| if_only_decode = self.only_decode() | ||
| if ( | ||
| self.fd_config.scheduler_config.splitwise_role == "mixed" | ||
| ): # 集中式场景,phase默认初始化为prefill, 推理运行时不同类型的batch能够在此处实现phase切换 | ||
| self.fd_config.model_config.moe_phase.phase = "decode" if if_only_decode else "prefill" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only_decoder=self.forward_meta.len_info_cpu[0]<=0
| permute_input, | ||
| token_nums_per_expert, | ||
| valid_token_num, | ||
| max(1, valid_token_num), # 确保空跑时也不为0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在算子moe_expert_ffn中支持valid_token_num=0的情况
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #5145 +/- ##
==========================================
Coverage ? 57.86%
==========================================
Files ? 317
Lines ? 38315
Branches ? 5727
==========================================
Hits ? 22171
Misses ? 14380
Partials ? 1764
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
Implement low-latency version communication operators for pure D requests, and high-throughput version communication operators for P requests in centralized inference scenarios.
Modifications
Usage or Command
export MOE_FFN_USE_DENSE_INPUT=1
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.