Add AutoEP + AutoTP parallel folding by tohtana · Pull Request #8064 · deepspeedai/DeepSpeed

tohtana · 2026-06-13T17:43:43Z

This PR adds parallel folding for AutoEP: tensor parallelism (AutoTP) for the dense/attention path can now coexist with expert parallelism (AutoEP) for the routed-expert path on the same set of ranks, without forcing EP to be a subset of DP.
(This PR should be adjusted for ZeRO3 support after #8060 is merged)

Design

Attention/dense and MoE are treated as two independent partitionings of the same rank set, parameterized per parameter family:

Dense / attention / shared-expert params: stage_size = tp * dp
Routed-expert params: stage_size = ep * etp * edp

dp and edp are always derived, never user-configured, so the invariant tp * dp == ep * etp * edp == stage_size cannot be broken from config.

Configuration

No new config section. Folding is expressed by the coexistence of the existing tensor_parallel and expert_parallel sections:

{
  "tensor_parallel":  { "autotp_size": 2 },
  "expert_parallel":  { "enabled": true, "autoep_size": 4,
                        "expert_tensor_parallel_size": 1 }
}

expert_tensor_parallel_size is carried as a config field but currently must be 1 (expert-internal TP is reserved as follow-up and rejected fail-fast). Validation enforces divisibility, TP/sequence-parallel exclusivity, and preset_model consistency between the two sections.

What's included

Folded process-group derivation using the generalized expert/data-parallel group creation (mp_mode TP-strided vs SP-consecutive ordering).
Route-full / partition-dispatch path for folded MoE (deepspeed/moe/ep_tp_dispatch.py), with AutoTP skipping AutoEP subtrees.
Mode-aware TP-replicated gradient reduction for router/gate params: summed when the parallelism mode partitions tokens, averaged when tokens are replicated — matching standard sequence-parallel / tensor-parallel gradient semantics.
Per-parameter-family ZeRO checkpoint metadata (routed-expert vs dense/router/shared placement) and folded ZeRO-1/2 optimizer-state handling.

Correctness & validation

Router/gate gradient parity against a non-folded ZeRO baseline on a TP2 × EP4 (8-GPU) shape: folded gradient matches baseline to ~1e-7 (scale 1.0).
New folding unit tests for config, group layout, dispatch, runtime, gradient parity, and checkpoint save/load (multi-rank cases gated for GPU runners).
Passes the full unit test suite (aws-torch-latest-full) on H100 GPUs.

Scope / follow-ups

This PR covers AutoEP + AutoTP folding. The replicated-grad reduction is mode-aware so the sequence-parallel (Ulysses) folding case fits the same contract; AutoTP + AutoEP is the validated path here.
Expert-internal tensor parallelism (expert_tensor_parallel_size > 1) is reserved for a follow-up.
ZeRO-3 composition with folding is planned as separate follow-up work. (It should be done after Support AutoEP with ZeRO-3 zero.Init source modules #8060 is merged)

Allow tensor parallelism (AutoTP) for the dense/attention path to coexist with expert parallelism (AutoEP) for routed experts on the same rank set, without requiring EP to be a subset of DP. - Treat dense and MoE as independent partitionings: dense view tp*dp, expert view ep*etp*edp, with dp/edp derived so tp*dp == ep*etp*edp == stage_size. expert_tensor_parallel_size is reserved (must currently be 1). - Express folding via the existing tensor_parallel/expert_parallel config sections, with divisibility, TP/sequence-parallel exclusivity, and preset_model consistency validation. - Add the route-full / partition-dispatch MoE path and AutoTP skipping of AutoEP subtrees; derive folded process groups via the generalized expert/data-parallel group creation. - Reduce TP-replicated router/gate gradients mode-aware (sum when tokens are partitioned, average when replicated); record per-parameter-family ZeRO checkpoint metadata and handle folded ZeRO-1/2 optimizer state. - Add folding unit tests (config, groups, dispatch, runtime, gradient parity, checkpoint), including multi-rank GPU-gated cases. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 278c919489

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-13T17:51:03Z

+        chunks = torch.split(grad_output, ctx.counts, dim=0)
+        grad_padded = grad_output.new_zeros((ctx.max_rows, *grad_output.shape[1:]))
+        if local_count:
+            grad_padded[:local_count].copy_(chunks[ctx.group_rank])
+        return grad_padded[:local_count].contiguous(), None, None, None


Sum gathered-row gradients across TP lanes

When folded MoE output is consumed differently on each TP lane (for example by a row-parallel/lm-head layer that slices the hidden dimension), every gathered row participates in the loss on every lane. This backward path only returns chunks[ctx.group_rank] from the local rank's grad_output, so contributions from peer lanes to this rank's local expert outputs and routing weights are dropped; the padded local gradient needs to be accumulated across ctx.group before returning.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-13T17:51:03Z

        grad_reduc = self.get_gradient_for_reduction(param)
+        self._maybe_reduce_autoep_folding_tp_gradient(param, grad_reduc)


Honor ds_grad_is_ready before TP reduction

In ZeRO-2 folded runs, parameters with ds_grad_is_ready=False are intentionally skipped until their transient/tiled gradient is complete, as the guard immediately below documents. Calling the new TP reduction before that guard mutates and all-reduces incomplete gradients for those parameters, which can corrupt the final accumulated gradient once the ready shard is eventually reduced.

Useful? React with 👍 / 👎.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

delock · 2026-06-18T09:47:14Z

+                         "is planned as follow-up work.")
+
+    expert_width = spec.ep_size * spec.etp_size
+    if spec.tp_size > 1 and expert_width > spec.dp_size:


What will happen if expert_width % spec.dp_size != 0?

delock · 2026-06-19T05:09:43Z

+            if not param.requires_grad or param.grad is None:
+                continue
+            if is_moe_param(param) or is_model_parallel_parameter(param):
+                continue


This filter cannot distingush router grads from laynorm grads. The router grads needs SUM because of dispatch of token to experts, but laynorm does not need SUM, so they need different reduce strategy.

Here is the test, I verified on a CPU system that this test case cannot pass.

# Copyright (c) DeepSpeed Team. # SPDX-License-Identifier: Apache-2.0 # DeepSpeed Team """Engine-path (zero_stage=0) parity: does folded layernorm/gate match the non-folded DP baseline in the FULL flow? (The author only tested ZeRO-2 parity; the engine path that runs at zero_stage=0/1 is untested.) CPU/Gloo, world=8. """ import deepspeed from unit.v1.moe.autoep_test_utils import make_autoep_config, run_cpu_gloo_test, seed_everything from unit.v1.moe.test_autoep_autotp_grad_parity import ( _router_grad_model, _run_router_grad_boundary, _full_grad_by_suffix, ) GATE_BASELINE = "model.layers.0.mlp.gate.weight" GATE_FOLDED = "model.layers.0.mlp.router.gate.weight" LN = "model.layers.0.input_layernorm.weight" def _baseline_cfg(): c = {k: v for k, v in make_autoep_config(zero_stage=0, ep_size=1, mixed_precision=False).items() if k != "expert_parallel"} c["gradient_accumulation_steps"] = 2 c["gradient_clipping"] = 0.0 c["communication_data_type"] = "fp32" c["optimizer"]["params"]["torch_adam"] = True return c def _folded_cfg(): c = make_autoep_config(zero_stage=0, ep_size=4, mixed_precision=False) c["gradient_accumulation_steps"] = 2 c["gradient_clipping"] = 0.0 c["communication_data_type"] = "fp32" c["optimizer"]["params"]["torch_adam"] = True c["expert_parallel"]["autoep_size"] = 4 c["tensor_parallel"] = {"autotp_size": 2, "partition_config": { "use_default_specs": False, "layer_specs": [{"patterns": [r".*\.weight$"], "partition_type": "skip"}]}} return c def _worker(rank, world_size, tmpdir): seed = 1234 tp_size = 2 logical_dp_world_size = world_size // tp_size logical_dp_rank = rank // tp_size seed_everything(seed) reference_state = _router_grad_model().state_dict() baseline_model = _router_grad_model() baseline_model.load_state_dict(reference_state) baseline_engine, *_ = deepspeed.initialize(model=baseline_model, config=_baseline_cfg()) _run_router_grad_boundary(baseline_engine, logical_dp_world_size=logical_dp_world_size, logical_dp_rank=logical_dp_rank, seed=seed) base_gate = _full_grad_by_suffix(baseline_engine, GATE_BASELINE) base_ln = _full_grad_by_suffix(baseline_engine, LN) folded_model = _router_grad_model() folded_model.load_state_dict(reference_state) folded_engine, *_ = deepspeed.initialize(model=folded_model, config=_folded_cfg()) _run_router_grad_boundary(folded_engine, logical_dp_world_size=logical_dp_world_size, logical_dp_rank=logical_dp_rank, seed=seed) folded_gate = _full_grad_by_suffix(folded_engine, GATE_FOLDED) folded_ln = _full_grad_by_suffix(folded_engine, LN) gate_ratio = (folded_gate.norm() / base_gate.norm()).item() ln_ratio = (folded_ln.norm() / base_ln.norm()).item() print(f"[rank {rank}] ENGINE(zero0) gate_ratio={gate_ratio:.4f} ln_ratio={ln_ratio:.4f}") if rank == 0: assert abs(gate_ratio - 1.0) <= 5e-3, f"gate parity: {gate_ratio}" assert abs(ln_ratio - 1.0) <= 5e-3, f"ln parity: {ln_ratio}" def test_b1_engine_path_parity(tmpdir): run_cpu_gloo_test(_worker, tmpdir, world_size=8)

tohtana requested review from GuanhuaWang, hwchen2017, loadams and tjruwase as code owners June 13, 2026 17:43

chatgpt-codex-connector Bot reviewed Jun 13, 2026

View reviewed changes

tohtana added 2 commits June 13, 2026 11:44

Fix folded TP gradient reductions

0d44a1d

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Normalize folded TP ZeRO gradients

8b1c042

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

PKUWZP self-requested a review June 14, 2026 02:16

delock reviewed Jun 18, 2026

View reviewed changes

delock reviewed Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AutoEP + AutoTP parallel folding#8064

Add AutoEP + AutoTP parallel folding#8064
tohtana wants to merge 3 commits into
deepspeedai:masterfrom
tohtana:tohtana/autoep-autotp-parallel-folding-design

tohtana commented Jun 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 13, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 13, 2026

Uh oh!

delock Jun 18, 2026

Uh oh!

delock Jun 19, 2026 •

edited

Loading

Uh oh!

delock Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		grad_reduc = self.get_gradient_for_reduction(param)
		self._maybe_reduce_autoep_folding_tp_gradient(param, grad_reduc)

Conversation

tohtana commented Jun 13, 2026

Design

Configuration

What's included

Correctness & validation

Scope / follow-ups

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

delock Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

delock Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

delock Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

delock Jun 19, 2026 •

edited

Loading