Skip to content

Conversation

@Baidu-AIAK
Copy link

@Baidu-AIAK Baidu-AIAK commented Dec 21, 2025

Problem Description

The current overlap_moe_expert_parallel_comm in Megatron-LM requires CUDA_DEVICE_MAX_CONNECTIONS to be set to a relatively large value (#2180 (comment)).

However, in certain scenarios, ensuring the correctness of communication and computation scheduling requires setting CUDA_DEVICE_MAX_CONNECTIONS=1. This constraint can weaken the parallelism of All-to-All (A2A) overlap in practice, leading to a noticeable gap between the achieved optimization benefits and the theoretical expectations.

Root Cause Analysis

The A2A overlap optimization works by splitting modules and overlapping the computation of one micro-batch with the communication of another, thereby hiding EP communication latency. The current scheduling logic can be summarized as follows:

Stream Phase 1 Phase 2 Phase 3
comm_stream combine_bwd dispatch_fwd → dispatch_bwd combine_fwd
comp_stream attn_fwd → post_attn_fwd mlp_bwd → mlp_bwd_dw → mlp_fwd post_attn_bwd → attn_bwd

The theoretical execution timeline of the overlap optimization is illustrated below:

base-Theoretical

However, in memory-constrained scenarios where both TP and CP are greater than 1, we are forced to set CUDA_DEVICE_MAX_CONNECTIONS=1 to ensure the correctness of synchronization. Under this constraint, the effective execution timeline of the overlap becomes as follows:

base-Current

As can be observed, the original overlap logic is completely disrupted. This issue is also discussed in #2630 (comment) . We attribute this behavior to setting CUDA_DEVICE_MAX_CONNECTIONS=1, which enforces a serialized kernel submission model. Under this model, the launch order of computation and communication kernels is forced to be consistent across all devices, effectively eliminating the intended concurrency between computation and communication.

  • We initially observed that when CUDA_DEVICE_MAX_CONNECTIONS=1 is set, achieving overlap between computation and communication requires launching the communication first, followed by the computation.

    • In the current implementation, the launch of mlp_bwd occurs before dispatch_fwd. As a result, under CUDA_DEVICE_MAX_CONNECTIONS=1, it is impossible to overlap mlp_bwd with dispatch_fwd. Although this overlap can be enabled by reordering the launch sequence, doing so degrades the overlap behavior of several subsequent modules. This issue is also discussed in [QUESTION] MoE communication & computation can only overlap partially #2180 (comment) .

    • As illustrated in the figure above, this launch-order constraint causes subsequent A2A communication modules to overlap with the next computation module instead, leading to a misaligned (shifted) overlap pattern.

  • For combine_fwd and post_attn_bwd → attn_bwd, regardless of which is launched first, overlap cannot be achieved when CUDA_DEVICE_MAX_CONNECTIONS=1. This behavior is also reported in [QUESTION] MoE communication & computation can only overlap partially #2180 (comment)

    • We believe this is due to inherent data dependencies: the computation of mlp_bda must occur after mlp.combine, which forces combine_fwd, post_attn_bwd → attn_bwd, and PP_fwd to execute serially.

    • In the current implementation, combine_fwd is launched before post_attn_bwd → attn_bwd. However, because combine_fwd ends with a bda operation, it cannot overlap with post_attn_bwd → attn_bwd.

Furthermore, since the launch of PP_fwd occurs after post_attn_bwd → attn_bwd, PP_fwd cannot be overlapped either.

Our Solution

Our solution can be summarized in two steps. First, we split the original combine module into two separate stages, combine and post_combine, thereby decoupling communication from computation within the combine phase. Second, by further adjusting the scheduling logic, we are able to achieve overlap between all A2A communications and PP communications, even when CUDA_DEVICE_MAX_CONNECTIONS=1 is enforced.

The newly designed scheduling logic can be summarized as follows:

Stream Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6
comm_stream combine_bwd dispatch_fwd dispatch_bwd combine_fwd PP_fwd PP_bwd
comp_stream post_combine_bwd → attn_fwd → post_attn_fwd mlp_bwd mlp_fwd mlp_bwd_dw → post_attn_bwd → post_combine_fwd attn_bwd attn_bwd_dw

The theoretical execution timeline of the overlap optimization is illustrated below:

ours

Evaluation

Finally, we evaluated the performance of the DeepSeek-V3.1 model on 64 Hopper-architecture GPUs using this optimization. With DeepEP optimization and A2A overlap enabled, and with CUDA_DEVICE_MAX_CONNECTIONS=1 set, our approach achieves more than a 10% performance improvement in our target scenario compared to the current A2A overlap implementation in Megatron-LM.

Nsight Systems Comparison

Before Optimization (Nsight Systems):

image-2

As shown above, the actual Nsight Systems trace matches the previously illustrated behavior of the current implementation. The overlap logic is misaligned, resulting in exposed (non-overlapped) communication phases for both combine_fwd and PP_fwd.

After Optimization (Nsight Systems):

image-1

With our optimization applied, all A2A communications and PP communications are successfully overlapped.

Summary

Overall, our optimization provides a more effective All-to-All (A2A) overlap solution for scenarios in which CUDA_DEVICE_MAX_CONNECTIONS=1 must be enforced. In other words, our approach makes A2A overlap optimization no longer heavily dependent on setting CUDA_DEVICE_MAX_CONNECTIONS=32.

@Baidu-AIAK Baidu-AIAK requested review from a team as code owners December 21, 2025 12:55
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 21, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Baidu-AIAK Baidu-AIAK changed the title [Dev][MoE] Refine A2A overlap under CUDA_DEVICE_MAX_CONNECTIONS=1 [Dev](moe):Refine A2A overlap under CUDA_DEVICE_MAX_CONNECTIONS=1 Dec 21, 2025
@BestJuly BestJuly requested a review from Wohox December 23, 2025 00:25
@Baidu-AIAK
Copy link
Author

Hi @Wohox. I'd really appreciate any feedback when you have time, especially if there are design or implementation concerns I should address. Thanks a lot!

@yaox12 yaox12 added the Expert Review Apply this label to indicate that your PR is ready for expert review. label Jan 4, 2026
f_input = f_layer.post_combine.forward(f_input)
f_input = f_layer.mtp_post_process.forward(f_input)

if is_last_layer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

post_forward and post_backward shouldn't be called in layer schedule, you can wait for current stream in chunk schedule.

b_grad = b_layer.attn.backward(b_grad)

if is_last_layer:
if b_schedule_plan is not None and post_backward is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

is_last_layer_in_bwd (bool):
Whether the current layer is the last layer in the backward pass.
fine_grained_overlap (bool):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Can you run experiments to show spilt combine is always no less efficient than the original design(I think so even if max conn = 32), if so, there is no need to distinguish between the two.

Copy link
Author

@Baidu-AIAK Baidu-AIAK Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our original intention was to make minimally invasive changes to the original code and provide an incremental optimization. As you pointed out, if our approach can achieve performance that is no worse than the existing Megatron-LM implementation when CUDA_DEVICE_MAX_CONNECTIONS = 32, it would be reasonable to appropriately merge the two to reduce code redundancy. After careful consideration, we have revised the implementation accordingly and provide the following theoretical analysis and experimental results. If there are any edge cases that we have not considered, we would greatly appreciate your feedback.

When CUDA_DEVICE_MAX_CONNECTIONS = 1, our implementation outperforms the existing Megatron-LM implementation; theoretical analysis and experimental evidence have already been provided in the PR description.

When CUDA_DEVICE_MAX_CONNECTIONS = 32:

  • Theoretical analysis

From a theoretical perspective, the computation–communication overlap execution timelines of the Megatron-LM implementation and ours are shown below. Although some execution orders differ, the overall effects are equivalent.

  1. Megatron-LM:
base-Theoretical
  1. Ours:
32-理论-ours
  • Experimental results

We conducted experiments on DeepSeek-V3 with different configurations, including EP16 and EP32. The results show that the performance of our approach is almost identical to that of the existing Megatron-LM implementation, and in some cases even slightly better (as shown in item 3 below).

  • Additional observations

Finally, we observed that in certain cases, even with CUDA_DEVICE_MAX_CONNECTIONS = 32, our approach outperforms the Megatron-LM implementation. As shown in the two figures below, because the combine_fwd communication in our approach is launched relatively earlier, it can exhibit better robustness in scenarios where CPU overhead is present.

  1. Megatron-LM:
    image-mg

  2. Ours:
    image-ours

fine_grained_overlap (bool):
Enable fine-grained communication / computation overlap
post_forward (callable or None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove post_forward, post_backward, f_schedule_plan and b_schedule_plan from function signature.

# release tensor reference after use
# release tensor references after use
if shared_expert_output is not None:
shared_expert_output.untyped_storage().resize_(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering what's special about shared_expert_output here, why can't it be deallocated like other detached tensors?

@lhb8125
Copy link
Contributor

lhb8125 commented Jan 4, 2026

Setting the CUDA_DEVICE_MAX_CONNECTIONS to 1 or 32 depends on which one is the most prominent factor when TP comm and EP comm both exist. The chance of TP comm overlapping is reduced but not impossible when setting it to 32 at TP=2, and in many cases, EP comm is the most prominent factor, so we think setting the env to 32 should be good as of now.

Could you perform some experiments on the DSV3 model to show the necessity of this PR?

@Baidu-AIAK
Copy link
Author

@lhb8125 Thank you for your response. Indeed, in most cases, setting CUDA_DEVICE_MAX_CONNECTIONS to 32 can achieve very good overlap for EP communication.

However, the motivation of this PR targets memory-constrained post-training scenarios with long sequences on a relatively small number of GPUs. For example, when training DeepSeek-V3 with long sequences (32K, 64K, etc.) on 128 GPUs using a parallel strategy of TP=8 and EP=16, TP communication accounts for about 5–6% of the runtime, while EP communication accounts for roughly 10%. In such cases, we usually aim to overlap both TP and EP communications to achieve optimal performance. Our approach provides an optimization opportunity for this specific scenario.

In addition, we have observed that setting CUDA_DEVICE_MAX_CONNECTIONS to 32 can conflict with certain optimizations, potentially leading to training instability, such as gradient norm becoming NaN.

We are preparing the required machines and will supplement this PR with additional experimental results soon to further demonstrate the necessity of this approach.

@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Jan 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request Expert Review Apply this label to indicate that your PR is ready for expert review. needs-follow-up Issue needs follow-up

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants