Skip to content

feat: add Qwen3-MoE and OLMoE on the shared MoE FFN primitive#563

Merged
inureyes merged 1 commit into
mainfrom
feature/issue-501-moe-family-coverage
Jul 1, 2026
Merged

feat: add Qwen3-MoE and OLMoE on the shared MoE FFN primitive#563
inureyes merged 1 commit into
mainfrom
feature/issue-501-moe-family-coverage

Conversation

@inureyes

@inureyes inureyes commented Jul 1, 2026

Copy link
Copy Markdown
Member

Summary

Extend the mixture-of-experts family coverage (issue #501) on top of the shared MoE FFN primitive (#500) and the qk-norm attention core (#497). Qwen3-MoE and OLMoE both route with the exact softmax-over-all-experts, top-k, norm_topk_prob renormalization the primitive already emits and have no shared expert, so they land as a config plus weight-naming delta on the existing emit. Families whose routing or attention the shared core does not provide are deferred with a specific message rather than mis-emitted.

What changed

  • emitter/config.rs: Config::from_json now parses qwen3_moe (per-head q/k RMSNorm attention, no bias, experts on moe_intermediate_size) and olmoe (flat q/k RMSNorm on the standard pre-norm block, experts on intermediate_size, clip_qkv rejected as a follow-up), each building a MoeConfig with no shared expert. DeepSeek-V2/V3, PhiMoE, GLM4-MoE, dots1, ERNIE-4.5-MoE, and gpt-oss are rejected with a blocker-specific message.
  • weights.rs: a weight_specs test locks the new families' expert-bank naming on the mlp prefix (mlp.gate.weight, mlp.switch_mlp.{gate,up,down}_proj.weight), matching the mlx-lm switch_mlp checkpoint layout, with no shared-expert / dense-MLP / bias tensors. The existing generic MoE weight_specs needed no change.
  • emitter/mod.rs: replaced the outdated qwen3_moe deferral test with parse tests for the two new families, an emit test that the qk-norm attention composes with the MoE FFN across every graph kind, and a deferral test naming each blocked family's reason.
  • validation.rs: registered byte-exact structural fixtures qwen3-moe-tiny and olmoe-tiny (per-head vs flat q/k-norm shapes locked, router + stacked-expert args present, no shared/dense-MLP args); generalized the golden-freeze helper over the MoE fixtures.
  • spike/openxla/moe_oracle.py: block-level checks for Qwen3-MoE and OLMoE against an HF fp32 MoE block.

Validation

  • Block-level token-exact vs HF fp32 (spike/openxla/moe_oracle.py, IREE llvm-cpu): Qwen3-MoE max block diff ~1.9e-9, OLMoE ~2.2e-8; Qwen2-MoE / Mixtral still pass.
  • cargo test -p mlxcel-xla --lib: all tests pass (byte-exact gate now covers both new fixtures; all existing dense + MoE fixtures byte-identical).
  • cargo clippy -p mlxcel-xla --lib --tests -- -D warnings: clean.
  • Full-model token-exact and xla_batch_bench serve reference-exact need the xla-iree build and are deferred to the post-merge gate (block-level + structural landed in-agent, allowed by the epic).

Test plan

  • spike/openxla/moe_oracle.py block-level oracle (all four MoE families PASS)
  • cargo test -p mlxcel-xla --lib (all tests pass, byte-exact gate green)
  • cargo clippy -p mlxcel-xla --lib --tests -- -D warnings (clean)
  • existing dense + MoE fixtures byte-identical (no golden drift)

Closes #501

Extend the mixture-of-experts family coverage (issue #501) on top of the shared MoE FFN primitive (#500) and the qk-norm attention core (#497). Both new families route with the exact softmax-over-all-experts, top-k, `norm_topk_prob` renormalization the primitive already emits and have no shared expert, so they are a config plus weight-naming delta on the existing emit.

Qwen3-MoE: Qwen3 attention (per-head q/k RMSNorm, no bias) plus the MoE FFN; experts use `moe_intermediate_size`. OLMoE: a flat q/k RMSNorm (like OLMo2) on the standard pre-norm block plus the MoE FFN; experts use `intermediate_size`, and `clip_qkv` is rejected as a follow-up. `Config::from_json` detects both; `weight_specs` names the expert bank on the `mlp` prefix (`mlp.gate.weight`, `mlp.switch_mlp.{gate,up,down}_proj.weight`), matching the mlx-lm `switch_mlp` checkpoint layout.

Families whose routing or attention the shared core does not reproduce are deferred with a specific message and a test rather than mis-emitted: DeepSeek-V2/V3 (multi-head latent attention), PhiMoE (sparsemixer routing), GLM4-MoE / dots1 (grouped sigmoid routing with a score-correction bias), ERNIE-4.5-MoE (interleaved RoPE), and gpt-oss (attention sinks plus a clamped gated-SwiGLU activation).

Validation: the block-level oracle (`spike/openxla/moe_oracle.py`) proves both new families token-exact against an HF fp32 MoE block (Qwen3-MoE max block diff ~1.9e-9, OLMoE ~2.2e-8, IREE llvm-cpu). Byte-exact structural fixtures `qwen3-moe-tiny` and `olmoe-tiny` join the registered gate, locking the per-head vs flat q/k-norm shapes and the router / stacked-expert args on the composed emit. The full-model token-exact run and the serve reference-exact check need the xla-iree build and are deferred to the post-merge gate. All existing dense and MoE fixtures stay byte-identical.
@inureyes inureyes added status:review Under review type:enhancement New features, capabilities, or significant additions priority:medium Medium priority area:models Model architectures, weights, loading, metadata area:architecture Architecture and code structure changes status:done Completed and removed status:review Under review labels Jul 1, 2026
@inureyes inureyes merged commit 5a70c96 into main Jul 1, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:architecture Architecture and code structure changes area:models Model architectures, weights, loading, metadata priority:medium Medium priority status:done Completed type:enhancement New features, capabilities, or significant additions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: MoE family coverage on the shared FFN primitive (Mixtral, Qwen2/Qwen3-MoE, OLMoE, DeepSeek-V2/V3, gpt-oss, and more)

1 participant