Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions docs/adr/0004-compute-backend-session-seam-and-stablehlo-family.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,17 @@ Consequences and refinements:

This ADR stays Proposed; it moves to Accepted once the OpenXLA backend integrates behind the #448 session contract (Phase 3) and validates the full contract.

### Low-precision performance decision (2026-07)

Phase 3 landed the OpenXLA backend on macOS (Metal) and profiled it against MLX. On an M1 Ultra (Llama-3.2-1B, greedy) XLA-on-Metal runs about 117x slower than MLX. Profiling with `iree-benchmark-module` (pure runtime, no host glue) shows the Metal decode step is GPU-kernel-bound, not host-bound: a 13-thread CPU beats the Metal GPU on the same StableHLO graph, so the bottleneck is IREE's `metal-spirv` kernel codegen, not invoke overhead or bandwidth.

That split of the performance levers decides where the project invests:

- Graph-level precision and quantization, authored in the StableHLO graph, transfer to every IREE target, including future NPUs, for which low precision is the entry ticket rather than a 2x optimization. f16 / bf16 is landed (#514, #515): about 1.9x on Metal, token-exact, and it speeds up the CPU path too. This is in scope.
- Per-backend kernel codegen (the remaining ~50x to MLX) is upstream IREE's responsibility, is Metal-specific (it does not transfer to non-SPIR-V NPUs), and MLX already owns Apple-Silicon performance. Out of scope.

int8 / int4 weight quantization (#516) is the NPU lever, but its payoff is memory bandwidth: a compute-bound Metal decode cannot demonstrate it, and measuring it needs an actual NPU. It is deferred to a hardware-gated follow-up; on Metal only its token-exactness would be verifiable. Metal's absolute throughput is therefore a pessimistic proxy for an NPU, which brings its own optimized kernels.

## Consequences

- The `ComputeBackend` trait from PR #446 is reworked from a load-boundary contract returning `LoadedModel` into a session-engine contract. The selection skeleton (`select_backend`, the `Backend` enum, the `experimental-backend` feature gate) survives the rework; only the contract shape changes.
Expand Down
56 changes: 54 additions & 2 deletions src/lib/mlxcel-xla/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,12 +124,12 @@ The contraction (matmul) input precision is authored in the emitted StableHLO
graph, so it applies on every IREE target (CPU, CUDA, Metal, and future NPUs),
not just one backend:

- `f16` / `bf16` demote the f32 inputs of every `dot_general` to the narrow
- `f16` / `bf16`: demote the f32 inputs of every `dot_general` to the narrow
type while keeping the f32 accumulate and output, so only the matmuls change
and the sensitive elementwise ops (norm, softmax, RoPE) stay f32. A blanket
program-wide f32 to f16 is deliberately not done (it regressed accuracy and was
slower).
- `f32` no demotion.
- `f32`: no demotion.

**Default is per device:** `f16` on the GPU devices (`metal`, `cuda`), `f32` on
the CPU (`local-task` / `local-sync`). `MLXCEL_XLA_PRECISION` (`f16` | `bf16` |
Expand Down Expand Up @@ -272,6 +272,58 @@ Not every family bundles goldens: Qwen2.5 (`assets/qwen2.5-0.5b/`) is emitted at
load and covered by the emitter's structural tests plus the execution tier, with
no committed `.mlir`. The harness still drives its emit through `emit_graphs`.

## Performance and the low-precision decision

This backend is a portability / parity path, not a performance path. On Apple
Silicon, MLX is and remains the production backend; XLA runs the same StableHLO
graphs that target CUDA/Linux on a Mac, for development and cross-checking.

### Measured (M1 Ultra, Llama-3.2-1B-Instruct, greedy)

| | Metal | CPU (`local-task`) |
|---|---|---|
| one decode step, f32 | ~600 ms | ~233 ms |
| one decode step, f16 | ~291 ms (~2.1x) | ~187 ms (~1.25x) |
| end-to-end, f32 | ~1.5 tok/s | ~0.75 tok/s |
| end-to-end, f16 | ~3.0 tok/s | - |
| MLX (reference) | ~186 tok/s | - |

The decode-step figures are `iree-benchmark-module` (pure runtime, no host glue).

### Where the time goes

The Metal decode step is ~600 ms with the GPU busy the whole time (its host-side
`process_time` is ~50 ms), so the time is in the GPU kernels, not invoke overhead
or host round-trips. On the **same** StableHLO graph a 13-thread CPU (~233 ms)
beats the Metal GPU (~600 ms), which is the tell: the bottleneck is IREE's
`metal-spirv` kernel codegen (generic, unfused MSL via SPIRV-Cross), not
bandwidth. ~600 ms is ~110x the ~5 ms/token bandwidth floor MLX runs at.

### What is and is not worth optimizing

- **Graph-level (precision, quantization, op selection): in scope, transferable.**
Authored once in the portable graph, it helps every IREE target (CPU, CUDA,
Metal, future NPUs). f16 (landed) is ~1.9x and token-exact, and speeds up the
CPU path too. For NPUs, low precision / quantization is not a 2x optimization
but the entry ticket (they are int8 / fp16 native). This is where investment
pays off.
- **Per-backend kernel codegen (the remaining ~50x to MLX): out of scope.** That
is upstream IREE's job, is Metal-specific (does not transfer to non-SPIR-V
NPUs), and MLX already owns Apple-Silicon performance.

So Metal's absolute tok/s is a *pessimistic* proxy for an NPU, which brings its
own optimized kernels; what transfers is the graph, not the Metal tuning.

### Decision (2026-07)

- **In scope:** graph-level low precision. f16 / bf16 is landed. int8 / int4
weight quantization is the NPU lever, but its payoff is memory bandwidth, which
a compute-bound Metal decode cannot demonstrate and which needs an actual NPU to
measure; it is **deferred to a hardware-gated follow-up** (on Metal only its
token-exactness would be verifiable, not the speedup).
- **Out of scope:** hand-writing Metal kernels or tuning IREE's `metal-spirv`
codegen to chase MLX.

## File map

| Path | Purpose |
Expand Down