diff --git a/docs/adr/0004-compute-backend-session-seam-and-stablehlo-family.md b/docs/adr/0004-compute-backend-session-seam-and-stablehlo-family.md index 3de191cf..44d04023 100644 --- a/docs/adr/0004-compute-backend-session-seam-and-stablehlo-family.md +++ b/docs/adr/0004-compute-backend-session-seam-and-stablehlo-family.md @@ -79,6 +79,17 @@ Consequences and refinements: This ADR stays Proposed; it moves to Accepted once the OpenXLA backend integrates behind the #448 session contract (Phase 3) and validates the full contract. +### Low-precision performance decision (2026-07) + +Phase 3 landed the OpenXLA backend on macOS (Metal) and profiled it against MLX. On an M1 Ultra (Llama-3.2-1B, greedy) XLA-on-Metal runs about 117x slower than MLX. Profiling with `iree-benchmark-module` (pure runtime, no host glue) shows the Metal decode step is GPU-kernel-bound, not host-bound: a 13-thread CPU beats the Metal GPU on the same StableHLO graph, so the bottleneck is IREE's `metal-spirv` kernel codegen, not invoke overhead or bandwidth. + +That split of the performance levers decides where the project invests: + +- Graph-level precision and quantization, authored in the StableHLO graph, transfer to every IREE target, including future NPUs, for which low precision is the entry ticket rather than a 2x optimization. f16 / bf16 is landed (#514, #515): about 1.9x on Metal, token-exact, and it speeds up the CPU path too. This is in scope. +- Per-backend kernel codegen (the remaining ~50x to MLX) is upstream IREE's responsibility, is Metal-specific (it does not transfer to non-SPIR-V NPUs), and MLX already owns Apple-Silicon performance. Out of scope. + +int8 / int4 weight quantization (#516) is the NPU lever, but its payoff is memory bandwidth: a compute-bound Metal decode cannot demonstrate it, and measuring it needs an actual NPU. It is deferred to a hardware-gated follow-up; on Metal only its token-exactness would be verifiable. Metal's absolute throughput is therefore a pessimistic proxy for an NPU, which brings its own optimized kernels. + ## Consequences - The `ComputeBackend` trait from PR #446 is reworked from a load-boundary contract returning `LoadedModel` into a session-engine contract. The selection skeleton (`select_backend`, the `Backend` enum, the `experimental-backend` feature gate) survives the rework; only the contract shape changes. diff --git a/src/lib/mlxcel-xla/README.md b/src/lib/mlxcel-xla/README.md index 0d4f6490..5aa9df59 100644 --- a/src/lib/mlxcel-xla/README.md +++ b/src/lib/mlxcel-xla/README.md @@ -124,12 +124,12 @@ The contraction (matmul) input precision is authored in the emitted StableHLO graph, so it applies on every IREE target (CPU, CUDA, Metal, and future NPUs), not just one backend: -- `f16` / `bf16` — demote the f32 inputs of every `dot_general` to the narrow +- `f16` / `bf16`: demote the f32 inputs of every `dot_general` to the narrow type while keeping the f32 accumulate and output, so only the matmuls change and the sensitive elementwise ops (norm, softmax, RoPE) stay f32. A blanket program-wide f32 to f16 is deliberately not done (it regressed accuracy and was slower). -- `f32` — no demotion. +- `f32`: no demotion. **Default is per device:** `f16` on the GPU devices (`metal`, `cuda`), `f32` on the CPU (`local-task` / `local-sync`). `MLXCEL_XLA_PRECISION` (`f16` | `bf16` | @@ -272,6 +272,58 @@ Not every family bundles goldens: Qwen2.5 (`assets/qwen2.5-0.5b/`) is emitted at load and covered by the emitter's structural tests plus the execution tier, with no committed `.mlir`. The harness still drives its emit through `emit_graphs`. +## Performance and the low-precision decision + +This backend is a portability / parity path, not a performance path. On Apple +Silicon, MLX is and remains the production backend; XLA runs the same StableHLO +graphs that target CUDA/Linux on a Mac, for development and cross-checking. + +### Measured (M1 Ultra, Llama-3.2-1B-Instruct, greedy) + +| | Metal | CPU (`local-task`) | +|---|---|---| +| one decode step, f32 | ~600 ms | ~233 ms | +| one decode step, f16 | ~291 ms (~2.1x) | ~187 ms (~1.25x) | +| end-to-end, f32 | ~1.5 tok/s | ~0.75 tok/s | +| end-to-end, f16 | ~3.0 tok/s | - | +| MLX (reference) | ~186 tok/s | - | + +The decode-step figures are `iree-benchmark-module` (pure runtime, no host glue). + +### Where the time goes + +The Metal decode step is ~600 ms with the GPU busy the whole time (its host-side +`process_time` is ~50 ms), so the time is in the GPU kernels, not invoke overhead +or host round-trips. On the **same** StableHLO graph a 13-thread CPU (~233 ms) +beats the Metal GPU (~600 ms), which is the tell: the bottleneck is IREE's +`metal-spirv` kernel codegen (generic, unfused MSL via SPIRV-Cross), not +bandwidth. ~600 ms is ~110x the ~5 ms/token bandwidth floor MLX runs at. + +### What is and is not worth optimizing + +- **Graph-level (precision, quantization, op selection): in scope, transferable.** + Authored once in the portable graph, it helps every IREE target (CPU, CUDA, + Metal, future NPUs). f16 (landed) is ~1.9x and token-exact, and speeds up the + CPU path too. For NPUs, low precision / quantization is not a 2x optimization + but the entry ticket (they are int8 / fp16 native). This is where investment + pays off. +- **Per-backend kernel codegen (the remaining ~50x to MLX): out of scope.** That + is upstream IREE's job, is Metal-specific (does not transfer to non-SPIR-V + NPUs), and MLX already owns Apple-Silicon performance. + +So Metal's absolute tok/s is a *pessimistic* proxy for an NPU, which brings its +own optimized kernels; what transfers is the graph, not the Metal tuning. + +### Decision (2026-07) + +- **In scope:** graph-level low precision. f16 / bf16 is landed. int8 / int4 + weight quantization is the NPU lever, but its payoff is memory bandwidth, which + a compute-bound Metal decode cannot demonstrate and which needs an actual NPU to + measure; it is **deferred to a hardware-gated follow-up** (on Metal only its + token-exactness would be verifiable, not the speedup). +- **Out of scope:** hand-writing Metal kernels or tuning IREE's `metal-spirv` + codegen to chase MLX. + ## File map | Path | Purpose |