docs: OpenXLA performance table and the transferable-precision decision record (#449)#562
Merged
Conversation
…on record (#449) Document where the OpenXLA decode time goes and why the epic invests in graph-level precision, not Metal kernels. Adds a "Performance and the low-precision decision" section to the mlxcel-xla README (the measured f32/f16 decode-step and end-to-end table across Metal and CPU with MLX as reference; the profiling finding that the Metal decode is GPU-kernel-bound, with a 13-thread CPU beating the Metal GPU on the same graph; the transferable graph-level vs non-transferable per-backend-codegen split) and a dated decision-record addendum to ADR 0004. Records the decision: graph-level low precision is in scope (f16/bf16 landed in #514/#515); int8/int4 weight quantization (#516) is the NPU lever but its bandwidth payoff cannot be shown on a compute-bound Metal decode and needs an actual NPU to measure, so it is deferred to a hardware-gated follow-up; hand-tuning IREE's metal-spirv codegen to chase MLX is out of scope. Docs-only.
583cf99 to
ef5f58d
Compare
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #517. Part of #513.
What
Docs for the OpenXLA performance profiling and the transferable-precision decision.
mlxcel-xla/README.md: a "Performance and the low-precision decision" section: the measured f32/f16 decode-step and end-to-end table (Metal + CPU, MLX as reference); the profiling finding that the Metal decode is GPU-kernel-bound (a 13-thread CPU beats the Metal GPU on the same StableHLO graph, so it is IREEmetal-spirvcodegen, not host/invoke overhead or bandwidth); the transferable graph-level vs non-transferable per-backend-codegen split; and the decision.docs/adr/0004-...: a dated decision-record addendum.Decision recorded
Docs-only; no code change.