feat: dense arch pack (Seed-OSS, MiMo, InternLM3, ExaOne) in mlxcel-xla (#499) by inureyes · Pull Request #560 · lablup/mlxcel

inureyes · 2026-07-01T02:04:15Z

Summary

Adds four dense Llama-family architectures to the OpenXLA (mlxcel-xla) emitter as config / naming deltas over the two already-proven forwards (Llama and Qwen2), part of epic #493. The emitter's forward pass (emitter/model.rs) is untouched: each family was verified from its modeling code to use the emitter's half-split RoPE, cat(freqs, freqs) table and head_dim^-0.5 scaling, RMSNorm and SwiGLU, so its emitted graph is a proven graph rather than a new one.

Families added

Seed-OSS (seed_oss): q/k/v projection bias (from attention_bias), untied embeddings, rope_type = "default" served as plain RoPE. The proven Qwen2 bias forward with standard names.
MiMo (mimo): MiMoForCausalLM subclasses Qwen2ForCausalLM and reuses Qwen2Attention / Qwen2MLP / Qwen2RMSNorm verbatim (q/k/v bias, untied, plain RoPE); its multi-token-prediction heads are not loaded and its config sliding_window is served globally (None), as for Qwen2.
InternLM3 (internlm3): standard names, untied, rope_type = "dynamic" served as plain RoPE (dynamic NTK is identity within the original context; long-context rescale is a follow-up).
ExaOne 3.x (exaone): llama3 RoPE, tied, the GPT-2-style transformer.h.{i}... tensor names via a new WeightScheme::Exaone, and the num_layers / layer_norm_epsilon alternate config fields. Verified against the checkpoint's modeling_exaone.py (gated MLP c_proj(act(c_fc_0(x)) * c_fc_1(x)), so c_fc_0 is the gate and c_fc_1 the up projection).

What changed

emitter/config.rs: generalizes the q/k/v bias source (Qwen2 hard-coded, else attention_bias / qkv_bias), widens the accepted rope types (default / in-context dynamic -> plain), reads the alternate field names, and adds a WeightScheme. Out-of-scope deltas are rejected with clear errors: an attention output bias, an MLP bias, a non-SwiGLU activation, an unsupported rope type (e.g. yarn), and interleaved (GPT-J) RoPE (e.g. ERNIE-4.5).
weight_names.rs (new): the emitter's weight arg order moves to a pure-Rust module so ExaOne's GPT-2-style remap is unit-tested without the iree feature; iree.rs now calls it. The Llama scheme reproduces the previous names byte-for-byte, so Llama / Qwen2 / Gemma2 loading is unchanged.
validation.rs: registers each family as a byte-exact structural fixture (small synthetic configs keep the goldens small) and adds dense_pack_families_reuse_proven_graphs, which asserts each family emits StableHLO byte-for-byte identical to a proven Llama / Qwen2 reference across the single, prefill, and ragged graph kinds.
assets/{seed_oss,mimo,internlm3,exaone}/: fixture config.json + frozen decode.mlir / prefill.mlir + a README documenting each delta and its validation.
spike/openxla/synthetic_arch_check.py (new): builds a tiny random SeedOssForCausalLM and matches the emitter's prefill logits through IREE to HF; spike/openxla/arch_execution_check.py (new): drives a real-checkpoint continuation against an HF fp32 oracle. Neither builds xla-iree.

Validation

cargo test -p mlxcel-xla --lib: 72 passed (byte-exact fixtures, the reuse-proven-graphs equivalence, config-parse and weight-name unit tests, and the existing Llama/Qwen2/Gemma2 gates all green; the llama-3.2-1b goldens are unchanged).
Forward parity: synthetic_arch_check.py matches Seed-OSS to HF eager to max|logit diff| = 3.7e-9. The arch_execution_check.py harness ran end to end on the smallest checkpoint and correctly rejected ERNIE-4.5 (interleaved RoPE diverged from HF by ~8, vs ~1e-9 for a supported family), which is why it is deferred.
cargo clippy -p mlxcel-xla --lib --tests -- -D warnings: clean; cargo fmt applied.
The heavy token-exact / serve gate needs the native IREE build and runs post-merge via scripts/xla/validate_arch.sh on the real checkpoints.

Deferred (documented follow-ups)

Out of scope for a dense half-split pack, each needing a distinct emitter subsystem: ERNIE-4.5 (interleaved GPT-J RoPE), Mistral / Ministral (YaRN rope + VLM language_model. prefix), ExaOne4 (QK-norm + post-norm placement), InternLM2 (fused wqkv loader surgery), GLM4-flash and Solar-Open (MoE + MLA, not dense), Baichuan-M1 (conv + differentiated sliding-window heads), Apertus (xIELU activation), and Hunyuan-dense (cross-layer attention). The config.json parser rejects each with a clear, specific error.

Test plan

cargo test -p mlxcel-xla --lib (72 passed)
cargo clippy -p mlxcel-xla --lib --tests -- -D warnings (clean)
spike/openxla/synthetic_arch_check.py (Seed-OSS forward parity ~1e-9)
Post-merge: scripts/xla/validate_arch.sh --model <ckpt> token-exact / serve gate per family (needs the native IREE build)

Closes #499

Add four dense Llama-family architectures to the OpenXLA (mlxcel-xla) emitter as config / naming deltas over the two already-proven forwards (Llama and Qwen2), part of epic #493. Each was verified from its modeling code to use the emitter's half-split RoPE, `cat(freqs, freqs)` table and `head_dim^-0.5` scaling, RMSNorm and SwiGLU, so its emitted graph is a proven graph rather than a new one; the emitter's forward pass (model.rs) is untouched. Families: Seed-OSS (`seed_oss`, q/k/v bias from `attention_bias`, untied, `default` rope served as plain); MiMo (`mimo`, which subclasses Qwen2 verbatim, its multi-token-prediction heads ignored, config `sliding_window` served globally); InternLM3 (`internlm3`, untied, `dynamic` NTK rope served as plain in-context); ExaOne 3.x (`exaone`, llama3 rope, tied, GPT-2-style tensor names via a new `WeightScheme::Exaone`, and the `num_layers` / `layer_norm_epsilon` alternate config fields). Config parsing (`emitter/config.rs`) generalizes the q/k/v bias source (Qwen2 hard-coded, else `attention_bias` / `qkv_bias`), widens the accepted rope types (`default` / in-context `dynamic` -> plain), and reads the alternate field names, while rejecting out-of-scope deltas with clear errors: an attention output bias, an MLP bias, a non-SwiGLU activation, an unsupported rope type, and interleaved (GPT-J) RoPE. Tensor naming moves to a new pure-Rust `weight_names` module so ExaOne's GPT-2-style remap is unit-tested without the `iree` feature (the loader in `iree.rs` now calls it); the `Llama` scheme reproduces the previous names byte-for-byte, so Llama / Qwen2 / Gemma2 loading is unchanged. Validation: each family is registered as a byte-exact structural fixture (small synthetic configs keep the goldens small; the real `config.json` parsing is asserted in `config::tests`), and `dense_pack_families_reuse_proven_graphs` asserts each emits StableHLO byte-for-byte identical to a proven Llama / Qwen2 reference across the single, prefill, and ragged graph kinds. Two Python harnesses under `spike/openxla` prove forward parity through IREE without building `xla-iree`: `synthetic_arch_check.py` matches a tiny random `SeedOssForCausalLM` to the emitter's prefill logits to ~1e-9, and `arch_execution_check.py` drives a real-checkpoint continuation against an HF fp32 oracle (it surfaced ERNIE-4.5's interleaved RoPE, a delta config inspection misses). The heavy token-exact / serve gate runs post-merge via `scripts/xla/validate_arch.sh`. Deferred to follow-ups (documented, out of scope for a dense half-split pack): ERNIE-4.5 (interleaved GPT-J RoPE), Mistral / Ministral (YaRN rope plus a VLM `language_model.` prefix), ExaOne4 (QK-norm plus post-norm placement), InternLM2 (fused `wqkv` needing loader surgery), GLM4-flash and Solar-Open (MoE + MLA, not dense), Baichuan-M1 (conv + differentiated sliding-window heads), Apertus (xIELU activation), and Hunyuan-dense (cross-layer attention).

inureyes · 2026-07-01T02:08:35Z

Merge note: conflicts with #558 (complementary arch sets)

This PR was branched from e7ffc53. Sibling epic-#493 unit #558 ("dense arch pack for qk-norm and Gemma family") landed on main afterwards and restructured the same shared files, so GitHub reports this PR as conflicting. The two arch sets are disjoint and complementary, not duplicated:

This PR (feat: dense arch pack for remaining Llama-family architectures (Mistral/Ministral, ExaOne, InternLM, Baichuan, GLM4, and more) #499): seed_oss, mimo, internlm3, exaone (config / naming deltas over the proven Llama and Qwen2 forwards; no emitter/model.rs change).
feat: dense arch pack for qk-norm and Gemma family (Qwen3, Gemma1, Gemma3, SmolLM3, OLMo2/3) #558: qwen3, gemma1, gemma3, smollm3, olmo2, olmo3 (QK-norm / norm-placement / NoPE, with model.rs + rope.rs changes).

The merge needs to re-base this PR's additions onto #558's redesigned Config, which are otherwise orthogonal:

emitter/config.rs::from_json_str: add the four model_types, the rope-type widening (default / in-context dynamic -> plain), the generalized q/k/v bias source (attention_bias / qkv_bias), the ExaOne alternate fields (num_layers / layer_norm_epsilon), the WeightScheme field, and the interleaved-RoPE (ERNIE-4.5) rejection. feat: dense arch pack for qk-norm and Gemma family (Qwen3, Gemma1, Gemma3, SmolLM3, OLMo2/3) #558 replaced the gemma2: bool with NormStyle / QkNorm / embed_scale / norm_one_plus / mlp_geglu flags, so the Config literals here (weight_scheme) need to coexist with feat: dense arch pack for qk-norm and Gemma family (Qwen3, Gemma1, Gemma3, SmolLM3, OLMo2/3) #558's new fields.
Tensor naming: this PR extracted weight_names into a pure-Rust weight_names module (so ExaOne's GPT-2-style remap is unit-tested without the iree feature); feat: dense arch pack for qk-norm and Gemma family (Qwen3, Gemma1, Gemma3, SmolLM3, OLMo2/3) #558 kept weight_names inline in iree.rs and extended it (q/k norms, has_input_norm). Reconcile to one location, folding in the WeightScheme::Exaone remap.
validation.rs: union the fixture registrations and the dense_pack_families_reuse_proven_graphs cases.

Per the workflow, no merge was performed here.

Integrate origin/main (PR #558, issue #497: the qk-norm / Gemma dense pack for Qwen3, Gemma1/3, SmolLM3, OLMo2/3) into feature/issue-499-dense-arch-pack (issue #499: the remaining-Llama-family pack for Seed-OSS, MiMo, InternLM3, ExaOne). The two sides restructured the same shared files with disjoint, complementary architecture sets; this merge unifies them so one Config detects and one emitter serves both. config.rs: keep #497's orthogonal decomposition of the old `gemma2: bool` switch (NormStyle / QkNorm plus embed_scale / norm_one_plus / mlp_geglu / rope_local_base / sliding_pattern / use_rope_layers) and add #499's WeightScheme, the seed_oss / mimo / internlm3 / exaone match arms (config-driven q/k/v bias), the num_layers / layer_norm_epsilon alternate field reads, the default / dynamic rope types served as plain RoPE, and the ERNIE-4.5 / attention_out_bias / mlp_bias / non-SwiGLU rejections. Both flag sets now coexist on one Config, and both arch sets are detected. Weight naming was the main structural conflict: #497 edited an inline weight_names in iree.rs, while #499 relocated naming into a new weight_names.rs module with a per-scheme name table. Consolidated into the module. iree.rs imports weight_names, and weight_names.rs folds #497's arch-conditional order (a skippable input_layernorm for the OLMo reordered post-norm, the q/k norm weights, and the independent pre / post feed-forward norms) into #499's Llama / ExaOne schemes. One naming layer covers both arch sets, and Llama / Qwen2 / Gemma2 name lists stay byte-identical. validation.rs, assets, and the spike scripts are additive and disjoint, so both sides are kept. REGISTERED gains the four #499 golden fixtures (Seed-OSS / MiMo / InternLM3 / ExaOne); #497's golden-less STRUCTURAL_FAMILIES (Qwen3 / Gemma1/3 / SmolLM3 / OLMo2/3) stay alongside them. The emitter core (model.rs, rope.rs, builder.rs) is byte-identical to origin/main, and Llama / Qwen2 / Gemma2 parse to the same flags, so their emitted graphs are byte-for-byte unchanged (registered_fixtures_are_byte_exact passes). Validation: cargo test -p mlxcel-xla --lib is 84 passed / 0 failed; clippy -D warnings and fmt --check are clean; the spike execution checks are token-exact for both packs (Seed-OSS via synthetic_arch_check.py, and Qwen3 / Gemma1/3 / SmolLM3 / OLMo2 via dense_arch_pack_check.py).

inureyes added type:enhancement New features, capabilities, or significant additions priority:low Low priority area:architecture Architecture and code structure changes status:review Under review status:done Completed and removed status:review Under review labels Jul 1, 2026

inureyes merged commit 8cdfd1f into main Jul 1, 2026
5 checks passed

inureyes mentioned this pull request Jul 1, 2026

epic: OpenXLA backend architecture-coverage parity with the MLX engine #493

Closed

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: dense arch pack (Seed-OSS, MiMo, InternLM3, ExaOne) in mlxcel-xla (#499)#560

feat: dense arch pack (Seed-OSS, MiMo, InternLM3, ExaOne) in mlxcel-xla (#499)#560
inureyes merged 2 commits into
mainfrom
feature/issue-499-dense-arch-pack

inureyes commented Jul 1, 2026

Uh oh!

inureyes commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented Jul 1, 2026

Summary

Families added

What changed

Validation

Deferred (documented follow-ups)

Test plan

Uh oh!

inureyes commented Jul 1, 2026

Merge note: conflicts with #558 (complementary arch sets)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant