feat: add Qwen3-MoE and OLMoE on the shared MoE FFN primitive#563
Merged
Conversation
Extend the mixture-of-experts family coverage (issue #501) on top of the shared MoE FFN primitive (#500) and the qk-norm attention core (#497). Both new families route with the exact softmax-over-all-experts, top-k, `norm_topk_prob` renormalization the primitive already emits and have no shared expert, so they are a config plus weight-naming delta on the existing emit. Qwen3-MoE: Qwen3 attention (per-head q/k RMSNorm, no bias) plus the MoE FFN; experts use `moe_intermediate_size`. OLMoE: a flat q/k RMSNorm (like OLMo2) on the standard pre-norm block plus the MoE FFN; experts use `intermediate_size`, and `clip_qkv` is rejected as a follow-up. `Config::from_json` detects both; `weight_specs` names the expert bank on the `mlp` prefix (`mlp.gate.weight`, `mlp.switch_mlp.{gate,up,down}_proj.weight`), matching the mlx-lm `switch_mlp` checkpoint layout. Families whose routing or attention the shared core does not reproduce are deferred with a specific message and a test rather than mis-emitted: DeepSeek-V2/V3 (multi-head latent attention), PhiMoE (sparsemixer routing), GLM4-MoE / dots1 (grouped sigmoid routing with a score-correction bias), ERNIE-4.5-MoE (interleaved RoPE), and gpt-oss (attention sinks plus a clamped gated-SwiGLU activation). Validation: the block-level oracle (`spike/openxla/moe_oracle.py`) proves both new families token-exact against an HF fp32 MoE block (Qwen3-MoE max block diff ~1.9e-9, OLMoE ~2.2e-8, IREE llvm-cpu). Byte-exact structural fixtures `qwen3-moe-tiny` and `olmoe-tiny` join the registered gate, locking the per-head vs flat q/k-norm shapes and the router / stacked-expert args on the composed emit. The full-model token-exact run and the serve reference-exact check need the xla-iree build and are deferred to the post-merge gate. All existing dense and MoE fixtures stay byte-identical.
14 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extend the mixture-of-experts family coverage (issue #501) on top of the shared MoE FFN primitive (#500) and the qk-norm attention core (#497). Qwen3-MoE and OLMoE both route with the exact softmax-over-all-experts, top-k,
norm_topk_probrenormalization the primitive already emits and have no shared expert, so they land as a config plus weight-naming delta on the existing emit. Families whose routing or attention the shared core does not provide are deferred with a specific message rather than mis-emitted.What changed
emitter/config.rs:Config::from_jsonnow parsesqwen3_moe(per-head q/k RMSNorm attention, no bias, experts onmoe_intermediate_size) andolmoe(flat q/k RMSNorm on the standard pre-norm block, experts onintermediate_size,clip_qkvrejected as a follow-up), each building aMoeConfigwith no shared expert. DeepSeek-V2/V3, PhiMoE, GLM4-MoE, dots1, ERNIE-4.5-MoE, and gpt-oss are rejected with a blocker-specific message.weights.rs: aweight_specstest locks the new families' expert-bank naming on themlpprefix (mlp.gate.weight,mlp.switch_mlp.{gate,up,down}_proj.weight), matching the mlx-lmswitch_mlpcheckpoint layout, with no shared-expert / dense-MLP / bias tensors. The existing generic MoEweight_specsneeded no change.emitter/mod.rs: replaced the outdated qwen3_moe deferral test with parse tests for the two new families, an emit test that the qk-norm attention composes with the MoE FFN across every graph kind, and a deferral test naming each blocked family's reason.validation.rs: registered byte-exact structural fixturesqwen3-moe-tinyandolmoe-tiny(per-head vs flat q/k-norm shapes locked, router + stacked-expert args present, no shared/dense-MLP args); generalized the golden-freeze helper over the MoE fixtures.spike/openxla/moe_oracle.py: block-level checks for Qwen3-MoE and OLMoE against an HF fp32 MoE block.Validation
spike/openxla/moe_oracle.py, IREE llvm-cpu): Qwen3-MoE max block diff ~1.9e-9, OLMoE ~2.2e-8; Qwen2-MoE / Mixtral still pass.cargo test -p mlxcel-xla --lib: all tests pass (byte-exact gate now covers both new fixtures; all existing dense + MoE fixtures byte-identical).cargo clippy -p mlxcel-xla --lib --tests -- -D warnings: clean.xla_batch_benchserve reference-exact need the xla-iree build and are deferred to the post-merge gate (block-level + structural landed in-agent, allowed by the epic).Test plan
spike/openxla/moe_oracle.pyblock-level oracle (all four MoE families PASS)cargo test -p mlxcel-xla --lib(all tests pass, byte-exact gate green)cargo clippy -p mlxcel-xla --lib --tests -- -D warnings(clean)Closes #501