Skip to content

docs: OpenXLA performance table and the transferable-precision decision record (#449)#562

Merged
inureyes merged 1 commit into
mainfrom
feat/517-xla-perf-docs
Jul 1, 2026
Merged

docs: OpenXLA performance table and the transferable-precision decision record (#449)#562
inureyes merged 1 commit into
mainfrom
feat/517-xla-perf-docs

Conversation

@inureyes

@inureyes inureyes commented Jul 1, 2026

Copy link
Copy Markdown
Member

Closes #517. Part of #513.

What

Docs for the OpenXLA performance profiling and the transferable-precision decision.

  • mlxcel-xla/README.md: a "Performance and the low-precision decision" section: the measured f32/f16 decode-step and end-to-end table (Metal + CPU, MLX as reference); the profiling finding that the Metal decode is GPU-kernel-bound (a 13-thread CPU beats the Metal GPU on the same StableHLO graph, so it is IREE metal-spirv codegen, not host/invoke overhead or bandwidth); the transferable graph-level vs non-transferable per-backend-codegen split; and the decision.
  • docs/adr/0004-...: a dated decision-record addendum.
  • Drive-by: removed two em dashes the feat: f16/bf16 precision mode for the OpenXLA emitter and resident weights (#449) #514 Precision section left in the README.

Decision recorded

Docs-only; no code change.

@inureyes inureyes added type:docs Documentation improvements or additions area:inference Generation, sampling, decoding (incl. speculative, DRY) platform:macos macOS (Apple Silicon) specific labels Jul 1, 2026
…on record (#449)

Document where the OpenXLA decode time goes and why the epic invests in graph-level precision, not Metal kernels. Adds a "Performance and the low-precision decision" section to the mlxcel-xla README (the measured f32/f16 decode-step and end-to-end table across Metal and CPU with MLX as reference; the profiling finding that the Metal decode is GPU-kernel-bound, with a 13-thread CPU beating the Metal GPU on the same graph; the transferable graph-level vs non-transferable per-backend-codegen split) and a dated decision-record addendum to ADR 0004.

Records the decision: graph-level low precision is in scope (f16/bf16 landed in #514/#515); int8/int4 weight quantization (#516) is the NPU lever but its bandwidth payoff cannot be shown on a compute-bound Metal decode and needs an actual NPU to measure, so it is deferred to a hardware-gated follow-up; hand-tuning IREE's metal-spirv codegen to chase MLX is out of scope. Docs-only.
@inureyes inureyes force-pushed the feat/517-xla-perf-docs branch from 583cf99 to ef5f58d Compare July 1, 2026 03:32
@inureyes inureyes merged commit 4e770ec into main Jul 1, 2026
5 checks passed
@inureyes inureyes deleted the feat/517-xla-perf-docs branch July 1, 2026 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:inference Generation, sampling, decoding (incl. speculative, DRY) platform:macos macOS (Apple Silicon) specific type:docs Documentation improvements or additions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

docs: OpenXLA performance profiling, perf table, and the transferable-precision decision record (#449)

1 participant