feat: add Qwen3-MoE and OLMoE on the shared MoE FFN primitive by inureyes · Pull Request #563 · lablup/mlxcel

inureyes · 2026-07-01T04:50:21Z

Summary

Extend the mixture-of-experts family coverage (issue #501) on top of the shared MoE FFN primitive (#500) and the qk-norm attention core (#497). Qwen3-MoE and OLMoE both route with the exact softmax-over-all-experts, top-k, norm_topk_prob renormalization the primitive already emits and have no shared expert, so they land as a config plus weight-naming delta on the existing emit. Families whose routing or attention the shared core does not provide are deferred with a specific message rather than mis-emitted.

What changed

emitter/config.rs: Config::from_json now parses qwen3_moe (per-head q/k RMSNorm attention, no bias, experts on moe_intermediate_size) and olmoe (flat q/k RMSNorm on the standard pre-norm block, experts on intermediate_size, clip_qkv rejected as a follow-up), each building a MoeConfig with no shared expert. DeepSeek-V2/V3, PhiMoE, GLM4-MoE, dots1, ERNIE-4.5-MoE, and gpt-oss are rejected with a blocker-specific message.
weights.rs: a weight_specs test locks the new families' expert-bank naming on the mlp prefix (mlp.gate.weight, mlp.switch_mlp.{gate,up,down}_proj.weight), matching the mlx-lm switch_mlp checkpoint layout, with no shared-expert / dense-MLP / bias tensors. The existing generic MoE weight_specs needed no change.
emitter/mod.rs: replaced the outdated qwen3_moe deferral test with parse tests for the two new families, an emit test that the qk-norm attention composes with the MoE FFN across every graph kind, and a deferral test naming each blocked family's reason.
validation.rs: registered byte-exact structural fixtures qwen3-moe-tiny and olmoe-tiny (per-head vs flat q/k-norm shapes locked, router + stacked-expert args present, no shared/dense-MLP args); generalized the golden-freeze helper over the MoE fixtures.
spike/openxla/moe_oracle.py: block-level checks for Qwen3-MoE and OLMoE against an HF fp32 MoE block.

Validation

Block-level token-exact vs HF fp32 (spike/openxla/moe_oracle.py, IREE llvm-cpu): Qwen3-MoE max block diff ~1.9e-9, OLMoE ~2.2e-8; Qwen2-MoE / Mixtral still pass.
cargo test -p mlxcel-xla --lib: all tests pass (byte-exact gate now covers both new fixtures; all existing dense + MoE fixtures byte-identical).
cargo clippy -p mlxcel-xla --lib --tests -- -D warnings: clean.
Full-model token-exact and xla_batch_bench serve reference-exact need the xla-iree build and are deferred to the post-merge gate (block-level + structural landed in-agent, allowed by the epic).

Test plan

spike/openxla/moe_oracle.py block-level oracle (all four MoE families PASS)
cargo test -p mlxcel-xla --lib (all tests pass, byte-exact gate green)
cargo clippy -p mlxcel-xla --lib --tests -- -D warnings (clean)
existing dense + MoE fixtures byte-identical (no golden drift)

Closes #501

Extend the mixture-of-experts family coverage (issue #501) on top of the shared MoE FFN primitive (#500) and the qk-norm attention core (#497). Both new families route with the exact softmax-over-all-experts, top-k, `norm_topk_prob` renormalization the primitive already emits and have no shared expert, so they are a config plus weight-naming delta on the existing emit. Qwen3-MoE: Qwen3 attention (per-head q/k RMSNorm, no bias) plus the MoE FFN; experts use `moe_intermediate_size`. OLMoE: a flat q/k RMSNorm (like OLMo2) on the standard pre-norm block plus the MoE FFN; experts use `intermediate_size`, and `clip_qkv` is rejected as a follow-up. `Config::from_json` detects both; `weight_specs` names the expert bank on the `mlp` prefix (`mlp.gate.weight`, `mlp.switch_mlp.{gate,up,down}_proj.weight`), matching the mlx-lm `switch_mlp` checkpoint layout. Families whose routing or attention the shared core does not reproduce are deferred with a specific message and a test rather than mis-emitted: DeepSeek-V2/V3 (multi-head latent attention), PhiMoE (sparsemixer routing), GLM4-MoE / dots1 (grouped sigmoid routing with a score-correction bias), ERNIE-4.5-MoE (interleaved RoPE), and gpt-oss (attention sinks plus a clamped gated-SwiGLU activation). Validation: the block-level oracle (`spike/openxla/moe_oracle.py`) proves both new families token-exact against an HF fp32 MoE block (Qwen3-MoE max block diff ~1.9e-9, OLMoE ~2.2e-8, IREE llvm-cpu). Byte-exact structural fixtures `qwen3-moe-tiny` and `olmoe-tiny` join the registered gate, locking the per-head vs flat q/k-norm shapes and the router / stacked-expert args on the composed emit. The full-model token-exact run and the serve reference-exact check need the xla-iree build and are deferred to the post-merge gate. All existing dense and MoE fixtures stay byte-identical.

inureyes merged commit 5a70c96 into main Jul 1, 2026
5 checks passed

inureyes mentioned this pull request Jul 1, 2026

epic: OpenXLA backend architecture-coverage parity with the MLX engine #493

Closed

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add Qwen3-MoE and OLMoE on the shared MoE FFN primitive#563

feat: add Qwen3-MoE and OLMoE on the shared MoE FFN primitive#563
inureyes merged 1 commit into
mainfrom
feature/issue-501-moe-family-coverage

inureyes commented Jul 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Validation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

inureyes commented Jul 1, 2026 •

edited

Loading