-
Notifications
You must be signed in to change notification settings - Fork 3.5k
[Dev](moe):Refine A2A overlap under CUDA_DEVICE_MAX_CONNECTIONS=1 #2730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
|
Hi @Wohox. I'd really appreciate any feedback when you have time, especially if there are design or implementation concerns I should address. Thanks a lot! |
| f_input = f_layer.post_combine.forward(f_input) | ||
| f_input = f_layer.mtp_post_process.forward(f_input) | ||
|
|
||
| if is_last_layer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
post_forward and post_backward shouldn't be called in layer schedule, you can wait for current stream in chunk schedule.
| b_grad = b_layer.attn.backward(b_grad) | ||
|
|
||
| if is_last_layer: | ||
| if b_schedule_plan is not None and post_backward is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
| is_last_layer_in_bwd (bool): | ||
| Whether the current layer is the last layer in the backward pass. | ||
| fine_grained_overlap (bool): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can you run experiments to show spilt combine is always no less efficient than the original design(I think so even if max conn = 32), if so, there is no need to distinguish between the two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our original intention was to make minimally invasive changes to the original code and provide an incremental optimization. As you pointed out, if our approach can achieve performance that is no worse than the existing Megatron-LM implementation when CUDA_DEVICE_MAX_CONNECTIONS = 32, it would be reasonable to appropriately merge the two to reduce code redundancy. After careful consideration, we have revised the implementation accordingly and provide the following theoretical analysis and experimental results. If there are any edge cases that we have not considered, we would greatly appreciate your feedback.
When CUDA_DEVICE_MAX_CONNECTIONS = 1, our implementation outperforms the existing Megatron-LM implementation; theoretical analysis and experimental evidence have already been provided in the PR description.
When CUDA_DEVICE_MAX_CONNECTIONS = 32:
- Theoretical analysis
From a theoretical perspective, the computation–communication overlap execution timelines of the Megatron-LM implementation and ours are shown below. Although some execution orders differ, the overall effects are equivalent.
- Megatron-LM:
- Ours:
- Experimental results
We conducted experiments on DeepSeek-V3 with different configurations, including EP16 and EP32. The results show that the performance of our approach is almost identical to that of the existing Megatron-LM implementation, and in some cases even slightly better (as shown in item 3 below).
- Additional observations
Finally, we observed that in certain cases, even with CUDA_DEVICE_MAX_CONNECTIONS = 32, our approach outperforms the Megatron-LM implementation. As shown in the two figures below, because the combine_fwd communication in our approach is launched relatively earlier, it can exhibit better robustness in scenarios where CPU overhead is present.
| fine_grained_overlap (bool): | ||
| Enable fine-grained communication / computation overlap | ||
| post_forward (callable or None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove post_forward, post_backward, f_schedule_plan and b_schedule_plan from function signature.
| # release tensor reference after use | ||
| # release tensor references after use | ||
| if shared_expert_output is not None: | ||
| shared_expert_output.untyped_storage().resize_(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering what's special about shared_expert_output here, why can't it be deallocated like other detached tensors?
|
Setting the CUDA_DEVICE_MAX_CONNECTIONS to 1 or 32 depends on which one is the most prominent factor when TP comm and EP comm both exist. The chance of TP comm overlapping is reduced but not impossible when setting it to 32 at TP=2, and in many cases, EP comm is the most prominent factor, so we think setting the env to 32 should be good as of now. Could you perform some experiments on the DSV3 model to show the necessity of this PR? |
|
@lhb8125 Thank you for your response. Indeed, in most cases, setting CUDA_DEVICE_MAX_CONNECTIONS to 32 can achieve very good overlap for EP communication. However, the motivation of this PR targets memory-constrained post-training scenarios with long sequences on a relatively small number of GPUs. For example, when training DeepSeek-V3 with long sequences (32K, 64K, etc.) on 128 GPUs using a parallel strategy of TP=8 and EP=16, TP communication accounts for about 5–6% of the runtime, while EP communication accounts for roughly 10%. In such cases, we usually aim to overlap both TP and EP communications to achieve optimal performance. Our approach provides an optimization opportunity for this specific scenario. In addition, we have observed that setting CUDA_DEVICE_MAX_CONNECTIONS to 32 can conflict with certain optimizations, potentially leading to training instability, such as gradient norm becoming NaN. We are preparing the required machines and will supplement this PR with additional experimental results soon to further demonstrate the necessity of this approach. |


Problem Description
The current overlap_moe_expert_parallel_comm in Megatron-LM requires CUDA_DEVICE_MAX_CONNECTIONS to be set to a relatively large value (#2180 (comment)).
However, in certain scenarios, ensuring the correctness of communication and computation scheduling requires setting CUDA_DEVICE_MAX_CONNECTIONS=1. This constraint can weaken the parallelism of All-to-All (A2A) overlap in practice, leading to a noticeable gap between the achieved optimization benefits and the theoretical expectations.
Root Cause Analysis
The A2A overlap optimization works by splitting modules and overlapping the computation of one micro-batch with the communication of another, thereby hiding EP communication latency. The current scheduling logic can be summarized as follows:
The theoretical execution timeline of the overlap optimization is illustrated below:
However, in memory-constrained scenarios where both TP and CP are greater than 1, we are forced to set CUDA_DEVICE_MAX_CONNECTIONS=1 to ensure the correctness of synchronization. Under this constraint, the effective execution timeline of the overlap becomes as follows:
As can be observed, the original overlap logic is completely disrupted. This issue is also discussed in #2630 (comment) . We attribute this behavior to setting CUDA_DEVICE_MAX_CONNECTIONS=1, which enforces a serialized kernel submission model. Under this model, the launch order of computation and communication kernels is forced to be consistent across all devices, effectively eliminating the intended concurrency between computation and communication.
We initially observed that when CUDA_DEVICE_MAX_CONNECTIONS=1 is set, achieving overlap between computation and communication requires launching the communication first, followed by the computation.
In the current implementation, the launch of mlp_bwd occurs before dispatch_fwd. As a result, under CUDA_DEVICE_MAX_CONNECTIONS=1, it is impossible to overlap mlp_bwd with dispatch_fwd. Although this overlap can be enabled by reordering the launch sequence, doing so degrades the overlap behavior of several subsequent modules. This issue is also discussed in [QUESTION] MoE communication & computation can only overlap partially #2180 (comment) .
As illustrated in the figure above, this launch-order constraint causes subsequent A2A communication modules to overlap with the next computation module instead, leading to a misaligned (shifted) overlap pattern.
For combine_fwd and post_attn_bwd → attn_bwd, regardless of which is launched first, overlap cannot be achieved when CUDA_DEVICE_MAX_CONNECTIONS=1. This behavior is also reported in [QUESTION] MoE communication & computation can only overlap partially #2180 (comment)
We believe this is due to inherent data dependencies: the computation of mlp_bda must occur after mlp.combine, which forces combine_fwd, post_attn_bwd → attn_bwd, and PP_fwd to execute serially.
In the current implementation, combine_fwd is launched before post_attn_bwd → attn_bwd. However, because combine_fwd ends with a bda operation, it cannot overlap with post_attn_bwd → attn_bwd.
Furthermore, since the launch of PP_fwd occurs after post_attn_bwd → attn_bwd, PP_fwd cannot be overlapped either.
Our Solution
Our solution can be summarized in two steps. First, we split the original combine module into two separate stages, combine and post_combine, thereby decoupling communication from computation within the combine phase. Second, by further adjusting the scheduling logic, we are able to achieve overlap between all A2A communications and PP communications, even when CUDA_DEVICE_MAX_CONNECTIONS=1 is enforced.
The newly designed scheduling logic can be summarized as follows:
The theoretical execution timeline of the overlap optimization is illustrated below:
Evaluation
Finally, we evaluated the performance of the DeepSeek-V3.1 model on 64 Hopper-architecture GPUs using this optimization. With DeepEP optimization and A2A overlap enabled, and with CUDA_DEVICE_MAX_CONNECTIONS=1 set, our approach achieves more than a 10% performance improvement in our target scenario compared to the current A2A overlap implementation in Megatron-LM.
Nsight Systems Comparison
Before Optimization (Nsight Systems):
As shown above, the actual Nsight Systems trace matches the previously illustrated behavior of the current implementation. The overlap logic is misaligned, resulting in exposed (non-overlapped) communication phases for both combine_fwd and PP_fwd.
After Optimization (Nsight Systems):
With our optimization applied, all A2A communications and PP communications are successfully overlapped.
Summary
Overall, our optimization provides a more effective All-to-All (A2A) overlap solution for scenarios in which CUDA_DEVICE_MAX_CONNECTIONS=1 must be enforced. In other words, our approach makes A2A overlap optimization no longer heavily dependent on setting CUDA_DEVICE_MAX_CONNECTIONS=32.