[Dev](moe):Refine A2A overlap under CUDA_DEVICE_MAX_CONNECTIONS=1 #2730

Baidu-AIAK · 2025-12-21T12:55:57Z

Problem Description

The current overlap_moe_expert_parallel_comm in Megatron-LM requires CUDA_DEVICE_MAX_CONNECTIONS to be set to a relatively large value (#2180 (comment)).

However, in certain scenarios, ensuring the correctness of communication and computation scheduling requires setting CUDA_DEVICE_MAX_CONNECTIONS=1. This constraint can weaken the parallelism of All-to-All (A2A) overlap in practice, leading to a noticeable gap between the achieved optimization benefits and the theoretical expectations.

Root Cause Analysis

The A2A overlap optimization works by splitting modules and overlapping the computation of one micro-batch with the communication of another, thereby hiding EP communication latency. The current scheduling logic can be summarized as follows:

Stream	Phase 1	Phase 2	Phase 3
comm_stream	combine_bwd	dispatch_fwd → dispatch_bwd	combine_fwd
comp_stream	attn_fwd → post_attn_fwd	mlp_bwd → mlp_bwd_dw → mlp_fwd	post_attn_bwd → attn_bwd

The theoretical execution timeline of the overlap optimization is illustrated below:

However, in memory-constrained scenarios where both TP and CP are greater than 1, we are forced to set CUDA_DEVICE_MAX_CONNECTIONS=1 to ensure the correctness of synchronization. Under this constraint, the effective execution timeline of the overlap becomes as follows:

As can be observed, the original overlap logic is completely disrupted. This issue is also discussed in #2630 (comment) . We attribute this behavior to setting CUDA_DEVICE_MAX_CONNECTIONS=1, which enforces a serialized kernel submission model. Under this model, the launch order of computation and communication kernels is forced to be consistent across all devices, effectively eliminating the intended concurrency between computation and communication.

We initially observed that when CUDA_DEVICE_MAX_CONNECTIONS=1 is set, achieving overlap between computation and communication requires launching the communication first, followed by the computation.
- In the current implementation, the launch of mlp_bwd occurs before dispatch_fwd. As a result, under CUDA_DEVICE_MAX_CONNECTIONS=1, it is impossible to overlap mlp_bwd with dispatch_fwd. Although this overlap can be enabled by reordering the launch sequence, doing so degrades the overlap behavior of several subsequent modules. This issue is also discussed in [QUESTION] MoE communication & computation can only overlap partially #2180 (comment) .
- As illustrated in the figure above, this launch-order constraint causes subsequent A2A communication modules to overlap with the next computation module instead, leading to a misaligned (shifted) overlap pattern.
For combine_fwd and post_attn_bwd → attn_bwd, regardless of which is launched first, overlap cannot be achieved when CUDA_DEVICE_MAX_CONNECTIONS=1. This behavior is also reported in [QUESTION] MoE communication & computation can only overlap partially #2180 (comment)
- We believe this is due to inherent data dependencies: the computation of mlp_bda must occur after mlp.combine, which forces combine_fwd, post_attn_bwd → attn_bwd, and PP_fwd to execute serially.
- In the current implementation, combine_fwd is launched before post_attn_bwd → attn_bwd. However, because combine_fwd ends with a bda operation, it cannot overlap with post_attn_bwd → attn_bwd.

Furthermore, since the launch of PP_fwd occurs after post_attn_bwd → attn_bwd, PP_fwd cannot be overlapped either.

Our Solution

Our solution can be summarized in two steps. First, we split the original combine module into two separate stages, combine and post_combine, thereby decoupling communication from computation within the combine phase. Second, by further adjusting the scheduling logic, we are able to achieve overlap between all A2A communications and PP communications, even when CUDA_DEVICE_MAX_CONNECTIONS=1 is enforced.

The newly designed scheduling logic can be summarized as follows:

Stream	Phase 1	Phase 2	Phase 3	Phase 4	Phase 5	Phase 6
comm_stream	combine_bwd	dispatch_fwd	dispatch_bwd	combine_fwd	PP_fwd	PP_bwd
comp_stream	post_combine_bwd → attn_fwd → post_attn_fwd	mlp_bwd	mlp_fwd	mlp_bwd_dw → post_attn_bwd → post_combine_fwd	attn_bwd	attn_bwd_dw

The theoretical execution timeline of the overlap optimization is illustrated below:

Evaluation

Finally, we evaluated the performance of the DeepSeek-V3.1 model on 64 Hopper-architecture GPUs using this optimization. With DeepEP optimization and A2A overlap enabled, and with CUDA_DEVICE_MAX_CONNECTIONS=1 set, our approach achieves more than a 10% performance improvement in our target scenario compared to the current A2A overlap implementation in Megatron-LM.

Nsight Systems Comparison

Before Optimization (Nsight Systems):

As shown above, the actual Nsight Systems trace matches the previously illustrated behavior of the current implementation. The overlap logic is misaligned, resulting in exposed (non-overlapped) communication phases for both combine_fwd and PP_fwd.

After Optimization (Nsight Systems):

With our optimization applied, all A2A communications and PP communications are successfully overlapped.

Summary

Overall, our optimization provides a more effective All-to-All (A2A) overlap solution for scenarios in which CUDA_DEVICE_MAX_CONNECTIONS=1 must be enforced. In other words, our approach makes A2A overlap optimization no longer heavily dependent on setting CUDA_DEVICE_MAX_CONNECTIONS=32.

copy-pr-bot · 2025-12-21T12:56:01Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Baidu-AIAK · 2025-12-29T04:10:40Z

Hi @Wohox. I'd really appreciate any feedback when you have time, especially if there are design or implementation concerns I should address. Thanks a lot!

Wohox · 2026-01-04T07:57:42Z

megatron/core/models/common/model_chunk_schedule_plan.py

+                    f_input = f_layer.post_combine.forward(f_input)
+                    f_input = f_layer.mtp_post_process.forward(f_input)
+
+            if is_last_layer:


post_forward and post_backward shouldn't be called in layer schedule, you can wait for current stream in chunk schedule.

Wohox · 2026-01-04T07:58:11Z

megatron/core/models/common/model_chunk_schedule_plan.py

+                b_grad = b_layer.attn.backward(b_grad)
+
+            if is_last_layer:
+                if b_schedule_plan is not None and post_backward is not None:


Same as above

Wohox · 2026-01-04T08:01:11Z

megatron/core/models/common/model_chunk_schedule_plan.py

            is_last_layer_in_bwd (bool):
                Whether the current layer is the last layer in the backward pass.
-
+            fine_grained_overlap (bool):


Can you run experiments to show spilt combine is always no less efficient than the original design(I think so even if max conn = 32), if so, there is no need to distinguish between the two.

Our original intention was to make minimally invasive changes to the original code and provide an incremental optimization. As you pointed out, if our approach can achieve performance that is no worse than the existing Megatron-LM implementation when CUDA_DEVICE_MAX_CONNECTIONS = 32, it would be reasonable to appropriately merge the two to reduce code redundancy. After careful consideration, we have revised the implementation accordingly and provide the following theoretical analysis and experimental results. If there are any edge cases that we have not considered, we would greatly appreciate your feedback.

When CUDA_DEVICE_MAX_CONNECTIONS = 1, our implementation outperforms the existing Megatron-LM implementation; theoretical analysis and experimental evidence have already been provided in the PR description.

When CUDA_DEVICE_MAX_CONNECTIONS = 32:

Theoretical analysis

From a theoretical perspective, the computation–communication overlap execution timelines of the Megatron-LM implementation and ours are shown below. Although some execution orders differ, the overall effects are equivalent.

Megatron-LM:

Ours:

Experimental results

We conducted experiments on DeepSeek-V3 with different configurations, including EP16 and EP32. The results show that the performance of our approach is almost identical to that of the existing Megatron-LM implementation, and in some cases even slightly better (as shown in item 3 below).

Additional observations

Finally, we observed that in certain cases, even with CUDA_DEVICE_MAX_CONNECTIONS = 32, our approach outperforms the Megatron-LM implementation. As shown in the two figures below, because the combine_fwd communication in our approach is launched relatively earlier, it can exhibit better robustness in scenarios where CPU overhead is present.

Megatron-LM:

Ours:

Wohox · 2026-01-04T08:02:22Z

megatron/core/models/common/model_chunk_schedule_plan.py

-
+            fine_grained_overlap (bool):
+                Enable fine-grained communication / computation overlap
+            post_forward (callable or None):


Please remove post_forward, post_backward, f_schedule_plan and b_schedule_plan from function signature.

Wohox · 2026-01-04T08:11:30Z

megatron/core/models/gpt/fine_grained_callables.py

-        # release tensor reference after use
+        # release tensor references after use
+        if shared_expert_output is not None:
+            shared_expert_output.untyped_storage().resize_(0)


Wondering what's special about shared_expert_output here, why can't it be deallocated like other detached tensors?

lhb8125 · 2026-01-04T09:34:18Z

Setting the CUDA_DEVICE_MAX_CONNECTIONS to 1 or 32 depends on which one is the most prominent factor when TP comm and EP comm both exist. The chance of TP comm overlapping is reduced but not impossible when setting it to 32 at TP=2, and in many cases, EP comm is the most prominent factor, so we think setting the env to 32 should be good as of now.

Could you perform some experiments on the DSV3 model to show the necessity of this PR?

Baidu-AIAK · 2026-01-08T08:27:18Z

@lhb8125 Thank you for your response. Indeed, in most cases, setting CUDA_DEVICE_MAX_CONNECTIONS to 32 can achieve very good overlap for EP communication.

However, the motivation of this PR targets memory-constrained post-training scenarios with long sequences on a relatively small number of GPUs. For example, when training DeepSeek-V3 with long sequences (32K, 64K, etc.) on 128 GPUs using a parallel strategy of TP=8 and EP=16, TP communication accounts for about 5–6% of the runtime, while EP communication accounts for roughly 10%. In such cases, we usually aim to overlap both TP and EP communications to achieve optimal performance. Our approach provides an optimization opportunity for this specific scenario.

In addition, we have observed that setting CUDA_DEVICE_MAX_CONNECTIONS to 32 can conflict with certain optimizations, potentially leading to training instability, such as gradient norm becoming NaN.

We are preparing the required machines and will supplement this PR with additional experimental results soon to further demonstrate the necessity of this approach.

[Dev][MoE] Refine A2A overlap under CUDA_DEVICE_MAX_CONNECTIONS=1

c3f5861

Baidu-AIAK requested review from a team as code owners December 21, 2025 12:55

github-actions bot added the community-request label Dec 21, 2025

Baidu-AIAK changed the title ~~[Dev][MoE] Refine A2A overlap under CUDA_DEVICE_MAX_CONNECTIONS=1~~ [Dev](moe):Refine A2A overlap under CUDA_DEVICE_MAX_CONNECTIONS=1 Dec 21, 2025

Merge branch 'dev' into a2a-overlap-dev

4f6a6e4

BestJuly requested a review from Wohox December 23, 2025 00:25

Update model_chunk_schedule_plan.py

1c30302

yaox12 added the Expert Review Apply this label to indicate that your PR is ready for expert review. label Jan 4, 2026

Wohox reviewed Jan 4, 2026

View reviewed changes

zhanglingyun02 and others added 2 commits January 8, 2026 16:01

Reduce code redundancy for two different cases.

83fdeb6

Update model_chunk_schedule_plan.py

edac610

Update model_chunk_schedule_plan.py

9e1e24a

chtruong814 added the needs-follow-up Issue needs follow-up label Jan 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Dev](moe):Refine A2A overlap under CUDA_DEVICE_MAX_CONNECTIONS=1 #2730

[Dev](moe):Refine A2A overlap under CUDA_DEVICE_MAX_CONNECTIONS=1 #2730

Uh oh!

Baidu-AIAK commented Dec 21, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Dec 21, 2025

Uh oh!

Baidu-AIAK commented Dec 29, 2025

Uh oh!

Wohox Jan 4, 2026

Uh oh!

Wohox Jan 4, 2026

Uh oh!

Wohox Jan 4, 2026

Uh oh!

Baidu-AIAK Jan 8, 2026 •

edited

Loading

Uh oh!

Wohox Jan 4, 2026

Uh oh!

Wohox Jan 4, 2026

Uh oh!

lhb8125 commented Jan 4, 2026

Uh oh!

Baidu-AIAK commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Dev](moe):Refine A2A overlap under CUDA_DEVICE_MAX_CONNECTIONS=1 #2730

Are you sure you want to change the base?

[Dev](moe):Refine A2A overlap under CUDA_DEVICE_MAX_CONNECTIONS=1 #2730

Uh oh!

Conversation

Baidu-AIAK commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem Description

Root Cause Analysis

Our Solution

Evaluation

Nsight Systems Comparison

Summary

Uh oh!

copy-pr-bot bot commented Dec 21, 2025

Uh oh!

Baidu-AIAK commented Dec 29, 2025

Uh oh!

Wohox Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Wohox Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Wohox Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Baidu-AIAK Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Wohox Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Wohox Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

lhb8125 commented Jan 4, 2026

Uh oh!

Baidu-AIAK commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Baidu-AIAK commented Dec 21, 2025 •

edited

Loading

Baidu-AIAK Jan 8, 2026 •

edited

Loading