[Feature]【Hackathon 10th Spring No.47】Add MiniMax-M1 model support by cloudforge1 · Pull Request #6994 · PaddlePaddle/FastDeploy

cloudforge1 · 2026-03-24T11:36:23Z

Motivation

为 FastDeploy 增加部署 MiniMaxAI/MiniMax-M1-40k 系列模型的能力。

This PR adds support for deploying the MiniMax-M1 (456B MoE, 45.9B active) model family in FastDeploy, as required by Hackathon 10th Spring No.47.

MiniMax-M1 is a hybrid-attention Mixture-of-Experts LLM with:

Lightning Attention: 70 out of 80 layers use linear-complexity attention (O(n) vs O(n²))
Full GQA: 10 layers (indices 7,15,23,31,39,47,55,63,71,79) use standard grouped-query attention
MoE: 32 experts with top-2 routing per token
DeepNorm: Separate alpha/beta scaling for linear vs full attention layers
Postnorm: Residual carries normed activations (differs from standard pre-norm)
Architecture registered as both MiniMaxM1ForCausalLM and MiniMaxText01ForCausalLM

Design document: community#1252
Reference approved RFC: community#1156 (@NKNaN)

Modifications

Model Code (`fastdeploy/model_executor/models/minimax_m1.py`, ~800 lines)

9 classes implementing the full model:

MiniMaxM1MLP: Gate/up merged projection with SiLU activation
MiniMaxM1MoE: FusedMoE with 32 experts, top-2 routing, renormalize=True, quantization-aware weight_key_map (w4a8, w4afp8 static/dynamic, tensor_wise_fp8, block_wise_fp8)
MiniMaxM1FullAttention: Standard GQA with RoPE, used in 10 out of 80 layers
MiniMaxM1LinearAttention: Lightning attention with SiLU-gated QKV, output_gate (sigmoid), RMSNorm, persistent KV state history. Forward: SiLU(QKV) → lightning_attn → RMSNorm → sigmoid(gate) × hidden → out_proj
MiniMaxM1DecoderLayer: Dispatches to linear/full attention based on attn_type_list, DeepNorm scaling with separate alpha/beta per attention type, postnorm support
MiniMaxM1Model: Full transformer with embedding and final RMSNorm
MiniMaxM1ForCausalLM: Causal LM wrapper with dual weight loading:
- set_state_dict (v0 loader): HF key preprocessing (w1→gate_proj, w3→up_proj, w2→down_proj, q/k/v→qkv_proj concatenation)
- load_weights (v1 loader): stacked_params_mapping + FusedMoE.make_expert_params_mapping
MiniMaxM1PretrainedModel: Tensor parallel column/row split mappings

Lightning Attention Kernels (`fastdeploy/model_executor/ops/triton_ops/lightning_attn.py`, 711 lines)

Triton kernels for O(n) linear attention with exponential decay:

_fwd_kernel: Intra-block attention with causal masking and decay factors
_fwd_kv_kernel: Inter-block KV state accumulation with block-level decay
lightning_attention(): Python wrapper dispatching to Triton with automatic block size, dtype management, and KV history persistence

Documentation

docs/best_practices/MiniMax-M1.md + docs/zh/best_practices/MiniMax-M1.md: Bilingual usage guide with deployment examples
docs/supported_models.md + docs/zh/supported_models.md: Added MiniMax-M1 to LLM model table

Design Decisions

Followed DeepSeek-v3 model pattern (closest MoE architecture in FastDeploy) for weight loading
Linear attention forward follows vLLM's MiniMaxText01LinearAttention reference exactly
block_sparse_moe attribute name matches HF config convention (not mlp)
HF weight keys auto-mapped in both v0 and v1 loader paths — no manual renaming needed
Lightning Attention Triton kernels ported from MiniMax-M1 official repo

Usage or Command

# Deploy MiniMax-M1 with tensor parallelism
python -m fastdeploy.entrypoints.openai.api_server \
       --model MiniMaxAI/MiniMax-M1-40k \
       --tensor-parallel-size 8 \
       --max-model-len 40960 \
       --max-num-seqs 64

# Send a request
curl http://localhost:8180/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MiniMaxAI/MiniMax-M1-40k",
    "messages": [{"role": "user", "content": "What is lightning attention?"}],
    "max_tokens": 512
  }'

See docs/best_practices/MiniMax-M1.md for full deployment guide.

Accuracy Tests

Unit Tests (32/32 passed — CI verified on H20 GPU)

Test file: tests/model_executor/test_minimax_m1.py (390 lines, 8 classes, 32 tests)
TestLightningAttentionPurePython (4 tests): Reference NumPy implementation, block-size sweep, multi-head, KV history persistence
TestMoEConstruction (2 tests): Expert count, gate+experts construction
TestBuildSlopeTensor (3 tests): Exponential decay slopes for power-of-2 and non-power-of-2 head counts
TestModelRegistration (4 tests): Dual architecture registration (MiniMaxM1ForCausalLM + MiniMaxText01ForCausalLM)
TestDecoderLayerConstruction (9 tests): Linear/full attention dispatch, MoE vs dense MLP, postnorm config, fallback attention type, quantization weight_key_map (default/w4a8/w4afp8-dynamic)
TestDecoderLayerForward (5 tests): Forward shape validation, DeepNorm scaling, postnorm code path
TestFullModelConstruction (3 tests): Full model assembly, layer count, embedding dimensions
TestPretrainedModelMappings (2 tests): Tensor parallel split mappings

CI Results (commit `e068f01`)

run_tests_with_coverage: 32/32 tests passed in 5.37s (with coverage) + 1.76s (standalone) on H20 GPU
Job exited with code 8 due to unrelated flaky test tests/distributed/test_hopper_ll_precision_entry.py — not caused by this PR
All other 35+ CI checks passed

Pre-commit Validation

All hooks passing: black, isort, flake8, ruff, clang-format, merge conflict check, trailing whitespace, large file check.

Checklist

- Model scaffold: minimax_m1.py with hybrid attention (70 linear + 10 full GQA), MoE (32 experts top-2), DeepNorm scaling, weight loading - Lightning Attention: 5 Triton JIT kernels + 3 Python wrappers - Tests: 27 pytest cases covering attn dispatch, slope construction, registration, layer construction, and forward-pass smoke tests - Docs: EN/CN best practices + supported models list updates Architecture: MiniMaxText01ForCausalLM (456B MoE, 80 layers)

…ment load_weights - LinearAttention: add output_gate (sigmoid gating), norm (RMSNorm), rename o_proj → out_proj. Forward: SiLU on QKV → lightning_attn → norm → gate → out_proj - DecoderLayer: rename self.mlp → self.block_sparse_moe to match HF config - DeepNorm: branch alpha/beta on attention_type (linear vs full) - Postnorm: add two code paths following vLLM reference - KV state: persist _kv_history across forward calls - Dual registration: MiniMaxM1ForCausalLM + MiniMaxText01ForCausalLM - set_state_dict: preprocess HF keys (w1→gate_proj, w3→up_proj, w2→down_proj, q/k/v→qkv_proj concatenation) - load_weights: v1 loader with stacked_params_mapping + expert_params_mapping - Tests: 29/29 passing

paddle-bot · 2026-03-24T11:36:32Z

Thanks for your contribution!

codecov-commenter · 2026-03-24T13:38:14Z

Codecov Report

❌ Patch coverage is 34.05018% with 368 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@6cff780). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...oy/model_executor/ops/triton_ops/lightning_attn.py	9.40%	212 Missing ⚠️
fastdeploy/model_executor/models/minimax_m1.py	51.85%	155 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #6994   +/-   ##
==========================================
  Coverage           ?   73.50%           
==========================================
  Files              ?      401           
  Lines              ?    56603           
  Branches           ?     8890           
==========================================
  Hits               ?    41607           
  Misses             ?    12064           
  Partials           ?     2932

Flag	Coverage Δ
GPU	`73.50% <34.05%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Quantization-aware weight_key_map in MiniMaxM1MoE (w4a8, w4afp8 static/dynamic, tensor_wise_fp8, block_wise_fp8) mirroring Ernie4_5_MoE - Gate layer uses skip_quant=True, weight_dtype='float32' - set_state_dict v0 loader: quant-aware regex for expert weights (.quant_weight, .weight_scale, .activation_scale) - set_state_dict v0 loader: quant-aware qkv merge (suffix-keyed buffers) - 3 new tests: default/w4a8/w4afp8-dynamic weight_key_map branches

cloudforge1 added 13 commits March 6, 2026 10:30

Merge remote-tracking branch 'upstream/develop' into develop

daf20d9

Merge remote-tracking branch 'upstream/develop' into develop

6f1e63c

Merge remote-tracking branch 'upstream/develop' into develop

4deb7a7

Merge remote-tracking branch 'upstream/develop' into develop

676daf6

Merge remote-tracking branch 'upstream/develop' into develop

9bcfdca

Merge remote-tracking branch 'upstream/develop' into develop

2bfa878

Merge remote-tracking branch 'upstream/develop' into develop

262c470

Merge remote-tracking branch 'upstream/develop' into develop

171b4d3

Merge remote-tracking branch 'upstream/develop' into develop

def0bd2

Merge remote-tracking branch 'upstream/develop' into develop

4fad5dc

Merge remote-tracking branch 'upstream/develop' into develop

99b9f88

cloudforge1 temporarily deployed to Metax_ci March 24, 2026 11:36 — with GitHub Actions Inactive

paddle-bot bot added the contributor External developers label Mar 24, 2026

cloudforge1 deployed to Metax_ci March 24, 2026 17:34 — with GitHub Actions Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]【Hackathon 10th Spring No.47】Add MiniMax-M1 model support#6994

[Feature]【Hackathon 10th Spring No.47】Add MiniMax-M1 model support#6994
cloudforge1 wants to merge 14 commits intoPaddlePaddle:developfrom
cloudforge1:task/047-minimax-m1-model

cloudforge1 commented Mar 24, 2026 •

edited

Loading

Uh oh!

paddle-bot bot commented Mar 24, 2026

Uh oh!

codecov-commenter commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cloudforge1 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Model Code (fastdeploy/model_executor/models/minimax_m1.py, ~800 lines)

Lightning Attention Kernels (fastdeploy/model_executor/ops/triton_ops/lightning_attn.py, 711 lines)

Documentation

Design Decisions

Usage or Command

Accuracy Tests

Unit Tests (32/32 passed — CI verified on H20 GPU)

CI Results (commit e068f01)

Pre-commit Validation

Checklist

Uh oh!

paddle-bot bot commented Mar 24, 2026

Uh oh!

codecov-commenter commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cloudforge1 commented Mar 24, 2026 •

edited

Loading

Model Code (`fastdeploy/model_executor/models/minimax_m1.py`, ~800 lines)

Lightning Attention Kernels (`fastdeploy/model_executor/ops/triton_ops/lightning_attn.py`, 711 lines)

CI Results (commit `e068f01`)

codecov-commenter commented Mar 24, 2026 •

edited

Loading