fix(xla): load bf16 affine quant scales in the OpenXLA weight loader by inureyes · Pull Request #569 · lablup/mlxcel

inureyes · 2026-07-01T06:53:27Z

The XLA 4bit/8bit dequant loader hardcoded an f16 expectation for the affine scales/biases and rejected anything else. The pre-existing baselines (Llama-3.2-1B, Qwen2.5-0.5B) store f16 scales so it went unnoticed, but every Qwen3 dense checkpoint, Gemma3-27B, and Qwen3-MoE store bf16 scales, so XlaReferenceEngine::load failed with "scales/biases dtype BF16/BF16, expected F16" and those architectures could not load or generate on the XLA path at all. This surfaced during the epic #493 GPU (xla-iree, CUDA) validation sweep.

Thread a scales_bf16 flag through dequantize_affine and dequantize_affine_stacked and widen the matching 16-bit format to f32 via the existing bf16_to_f32 (both f16 and bf16 are exact in f32). iree.rs now accepts a matching F16 or BF16 scale/bias pair and rejects only a mismatched or non-16-bit pair. This completes the "loads and generates via CLI" acceptance for the Qwen3 / Gemma3 / MoE families added in epic #493.

Validation: new bf16-scales dequant unit test (2.0/0.5/10.0/-1.0 are exact in bf16, recovered row matches the f16 hand example); full mlxcel-xla lib suite green (112 passed); clippy/fmt clean. On a GB10 with the xla-iree CUDA build, qwen3-0.6b-4bit (bf16 scales) is now token-exact single-sequence vs the HF fp32 oracle (was previously unloadable), and gemma3-1b-4bit + stablelm-1.6b-4bit pass both the single-seq and serve gates.

Relates to #493.

The XLA 4bit/8bit dequant loader hardcoded an f16 expectation for the affine scales/biases and rejected anything else. The pre-existing baselines (Llama-3.2-1B, Qwen2.5-0.5B) happen to store f16 scales, so it went unnoticed, but every Qwen3 dense checkpoint, Gemma3-27B, and Qwen3-MoE store bf16 scales, so `XlaReferenceEngine::load` failed with "scales/biases dtype BF16/BF16, expected F16" and those architectures could not load or generate on the XLA path at all. Thread a `scales_bf16` flag through `dequantize_affine` and `dequantize_affine_stacked` and widen the matching 16-bit format to f32 via the existing `bf16_to_f32` (both f16 and bf16 are exact in f32). `iree.rs` now accepts a matching F16 or BF16 scale/bias pair and rejects only a mismatched or non-16-bit pair. This completes the "loads and generates via CLI" acceptance for the Qwen3 / Gemma3 / MoE families added in epic #493. Adds a bf16-scales dequant unit test (2.0/0.5/10.0/-1.0 are exact in bf16, so the recovered row matches the f16 hand example).

inureyes added 2 commits July 1, 2026 15:42

style(xla): rustfmt-wrap the bf16 stacked-dequant test call

6ef7dbc

inureyes merged commit ad6e786 into main Jul 1, 2026
5 checks passed

inureyes deleted the fix/xla-bf16-quant-scales branch July 1, 2026 06:58

This was referenced Jul 1, 2026

epic: OpenXLA backend architecture-coverage parity with the MLX engine #493

Closed

epic: realize low-precision / quantized performance on the OpenXLA (IREE) backend #570

Open

feat: fused quantized-matmul realizing the int8 bandwidth win #574

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(xla): load bf16 affine quant scales in the OpenXLA weight loader#569

fix(xla): load bf16 affine quant scales in the OpenXLA weight loader#569
inureyes merged 2 commits into
mainfrom
fix/xla-bf16-quant-scales

inureyes commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant