Skip to content

Add Metal 4 M5 prefill optimizations#15

Draft
ivanfioravanti wants to merge 16 commits into
antirez:mainfrom
ivanfioravanti:codex/metal4-m5-scaffold
Draft

Add Metal 4 M5 prefill optimizations#15
ivanfioravanti wants to merge 16 commits into
antirez:mainfrom
ivanfioravanti:codex/metal4-m5-scaffold

Conversation

@ivanfioravanti
Copy link
Copy Markdown
Contributor

Summary

  • enable M5-class Metal 4 MPP prefill paths for Q8_0 dense matmuls, attention-output low projection, and staged routed MoE projections
  • promote correctness-gated routed MoE boundaries: down from layer 2, gate/up from layer 13
  • add a fused six-expert routed MoE sum kernel for the common top-k=6 prefill shape
  • keep experimental probes and escape hatches for ablation (DS4_METAL_MPP_DISABLE=1, DS4_METAL_MOE_SUM6_DISABLE=1, staged routed MoE envs)

Benchmarks

Prompt source: README.md; command shape: ./ds4 --prompt-file <prompt> -n 1 --nothink --ctx 32768; 3 repeats per target.

target tokens standard non-M5 avg tok/s M5 no sum6 avg tok/s M5 default avg tok/s M5 speedup
512 236.97 390.29 389.47 1.64x
2048 324.90 480.07 478.36 1.47x
4096 289.97 454.41 455.52 1.57x
8192 287.60 440.43 442.69 1.54x
16384 284.26 427.62 427.78 1.50x

8192-token routed MoE stage profile, current M5 default:

stage total ms mean per layer ms share
up 2443.497 11.365 35.1%
gate 2422.942 11.269 34.8%
down 1727.114 8.033 24.8%
activation_weight 142.693 0.664 2.1%
sum 137.482 0.639 2.0%
map 68.601 0.319 1.0%
gate_up 12.389 0.058 0.2%

Disabling routed MoE MPP on the same profile drops prefill from 442.33 tok/s to 359.17 tok/s and raises gate/up/down to about 18.4-18.7 ms per layer.

Validation

  • make ds4 ds4_test
  • ./ds4_test --metal-kernels
  • ./ds4_test --long-context
  • ./ds4_test --logprob-vectors

Notes

  • Earlier up boundaries with gate=13 were tested but failed long-context; gate/up stay paired at layer 13.
  • A paired MPP gate+up matmul prototype compiled and ran but was slower overall, so it was not included.
  • DS4_METAL_MOE_MID_F32=1 looked slightly faster in a noisy local check, but the result is too small to promote without a broader clean sweep.

@ivanfioravanti ivanfioravanti marked this pull request as ready for review May 8, 2026 19:28
@antirez
Copy link
Copy Markdown
Owner

antirez commented May 9, 2026

Very significant speedup. I'm in the middle of a large refactoring so can't merge right now and there is the problem that I miss the hardware to later maintain it, but I'll see what I can do :) Thanks.

@ivanfioravanti
Copy link
Copy Markdown
Contributor Author

100% with you, we need to sort this out and let you get an M5 😎

@ivanfioravanti
Copy link
Copy Markdown
Contributor Author

This should solve #14

@antirez
Copy link
Copy Markdown
Owner

antirez commented May 9, 2026

@ivanfioravanti potential idea to get this merged: we take a m5-metal4 branch active, and you try to keep it rebased, if you like the idea. And we document it.

@ottaviofogliata
Copy link
Copy Markdown

ottaviofogliata commented May 9, 2026

@ivanfioravanti, just jumping in :) In a couple of days I’ll switch to a M5 Max 128GB. If it could be useful, I’d be more than happy to help you maintain the branch.

@ivanfioravanti
Copy link
Copy Markdown
Contributor Author

ivanfioravanti commented May 9, 2026

Oh yes @ottaviofogliata join the club! I'm trying to squeeze even more juice with the various optimizations suggested in the Metal Performance Primitives (MPP) Programming Guide without luck so far.

@ivanfioravanti ivanfioravanti force-pushed the codex/metal4-m5-scaffold branch from 980ba1a to 68547f8 Compare May 9, 2026 22:31
@ivanfioravanti
Copy link
Copy Markdown
Contributor Author

I squeezed a little more juice. Tomorrow I'll test with pi mono for some coding sessions and I will also get server side stats instead of client side.

comparison_speed_chart

@ivanfioravanti ivanfioravanti marked this pull request as draft May 10, 2026 07:21
@ivanfioravanti
Copy link
Copy Markdown
Contributor Author

logits are slightly different than the ones created with --quality. Converting to draft while I investigate, greedy is perfect, but distribution behind is different.

audreyt added a commit to audreyt/ds4 that referenced this pull request May 10, 2026
This is a personal fork that combines two open upstream PRs (the
support-q8_0-token-embd loader PR I sent to antirez/ds4, and
ivanfioravanti's PR antirez#15 for M5 Metal 4 prefill optimizations) so I can
run unmodified stock-recipe DeepSeek-V4-Flash GGUFs on M5 hardware
before either PR lands upstream.

The README explains:

  * What the two combined PRs do and why they're combined here.
  * Verified test matrix on M5 Max (antirez recipe, cyberneurova stock
    recipe, cyberneurova pre-harmonized).
  * The known MPP F16 + cyberneurova interaction (workaround:
    DS4_METAL_MPP_F16_DISABLE=1) and why it's separate from the loader
    PR's scope.
  * Build / run instructions for both recipes.
  * Acknowledgements to antirez, ivanfioravanti, ggml/llama.cpp, and
    the cyberneurova research project.
  * Pointer back to the upstream README for the original design and
    server/CLI docs (no duplication).
audreyt added a commit to audreyt/ds4 that referenced this pull request May 10, 2026
This is a personal fork that combines two open upstream PRs (the
support-q8_0-token-embd loader PR I sent to antirez/ds4, and
ivanfioravanti's PR antirez#15 for M5 Metal 4 prefill optimizations) so I can
run unmodified stock-recipe DeepSeek-V4-Flash GGUFs on M5 hardware
before either PR lands upstream.

The README explains:

  * What the two combined PRs do and why they're combined here.
  * Verified test matrix on M5 Max (antirez recipe, cyberneurova stock
    recipe, cyberneurova pre-harmonized).
  * The known MPP F16 + cyberneurova interaction (workaround:
    DS4_METAL_MPP_F16_DISABLE=1) and why it's separate from the loader
    PR's scope.
  * Build / run instructions for both recipes.
  * Acknowledgements to antirez, ivanfioravanti, ggml/llama.cpp, and
    the cyberneurova research project.
  * Pointer back to the upstream README for the original design and
    server/CLI docs (no duplication).
@ucjonathan
Copy link
Copy Markdown

ucjonathan commented May 10, 2026

If someone at Apple knew what was good for their hardware sales, they would have an M5 studio 256GB and and M5 Macbook Pro 128GB on @antirez desk Monday afternoon. Unfortunately I don't know anyone at Apple, but hopefully there are some developers at Apple watching this project that will wake up to this opportunity.

@ivanfioravanti
Copy link
Copy Markdown
Contributor Author

@ucjonathan I was going to propose the same thing!

@lobanov

This comment was marked as resolved.

audreyt added a commit to audreyt/ds4 that referenced this pull request May 10, 2026
When a stock-recipe GGUF (cyberneurova-style: Q8_0 small tensors, F32
router) is loaded on M5 with PR antirez#15's MPP optimizations enabled, the
compressor APE path silently produces wrong output and prefill emits
garbage tokens (typically <BOS> spam after a few coherent tokens).

The prefill is correct; the bug is in two compressor APE consumers that
were updated to accept Q8_0 ape_type but couldn't read Q8_0 byte layout
correctly:

1. `kernel_cpy_q8_0_f32` Metal kernel (added in support-q8_0-token-embd
   for the prefill APE byte-strided dequant): produces silently wrong
   output on M5 Max for the compressor APE shapes (4 rows x 1024 cols).
   Replaced with a CPU-side dequant into a per-call private MTLBuffer.
   The CPU dequant matches gguf-py reference byte-for-byte (verified
   with a standalone numeric check); the Metal kernel did not.

2. `kernel_dsv4_compressor_store_one` (decode-time single-row store in
   metal/dsv4_kv.metal): only handled F16/F32 ape_type; Q8_0 fell into
   the F32 else branch and read garbage.  Add a Q8_0 branch that walks
   block_q8_0 layout (uint16_t scale + 32 int8 quants per 34-byte
   block) directly.

The CPU dequant path also has to use a *fresh per-call* MTLBuffer for
each compressor invocation, not the shared g_compressor_store_ape_buffer:
multiple CPU writes to one shared buffer in the same command buffer
collapse to the last write at execute time (Metal kernels run in encode
order, but CPU writes don't participate in that ordering when the same
scratch is reused).  The per-call buffer is retained until cb completion
via addCompletedHandler because Metal does not strongly retain buffers
bound to encoders.

Changes:

  * ds4_metal.m: new `ds4_metal_half_bits_to_float` and
    `ds4_metal_cpu_dequant_q8_0_rows` helpers (verified against gguf-py
    `dequantize` reference); replace Q8_0 branches in
    `ds4_metal_encode_compressor_score_with_ape` and
    `ds4_metal_compressor_store_batch_tensor` with CPU-side dequant
    into per-call private buffers retained via addCompletedHandler.
  * metal/dsv4_kv.metal: add a Q8_0 branch to
    `kernel_dsv4_compressor_store_one`.
  * metal/cpy.metal: `kernel_cpy_q8_0_f32` is left in place but
    no longer reached from the compressor paths (its registration in
    ds4_metal.m is harmless).

Tested on macOS / M-series / Metal:

  * make ds4-server clean.
  * Cyberneurova Q2_K GGUF entirely unmodified, MPP F16 enabled (i.e.
    no DS4_METAL_MPP_F16_DISABLE workaround):
    21-token prompt -> coherent generation
    ("An LLM, or Large Language Model, is a type of artificial intelligence").
    Previously this prompt generated "An LLM, or large language" then
    <BOS> token spam.
  * Pre-harmonized variant: still works byte-for-byte the same as
    before this change, no F16/F32 path regressions.
@ivanfioravanti ivanfioravanti force-pushed the codex/metal4-m5-scaffold branch from 68547f8 to b703636 Compare May 10, 2026 14:54
audreyt added a commit to audreyt/ds4 that referenced this pull request May 10, 2026
…8_0 fixes)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@audreyt
Copy link
Copy Markdown

audreyt commented May 10, 2026

Forward-compat heads-up while you're rebasing: there's a latent compressor-APE interaction between this PR and Q8_0 ape ingestion (e.g. for stock-recipe GGUFs like cyberneurova's). On M5 Max:

  1. kernel_dsv4_compressor_store_one (in metal/dsv4_kv.metal, upstream) only handles F16/F32 ape — Q8_0 falls through to the F32 branch and reads garbage.
  2. A naive Q8_0 GPU dequant kernel for the prefill APE byte-strided path produces silently wrong output on M5 Max for compressor APE shapes (4 rows × 1024 cols). CPU-side dequant into a per-call private MTLBuffer is what works — a shared scratch loses writes when the same buffer is reused inside one command buffer (Metal kernels run in encode order; CPU writes don't participate in that ordering when the scratch is reused).

The bug only manifests when both your MPP work and Q8_0 ape branches are present, so nothing to fix in this PR in isolation. Just flagging in case #60 (the stock-recipe loader PR) lands — the fix at audreyt/ds4 m5-support-q8_0-token-embd (commit 79b08bb) would need to land alongside it. Happy to coordinate.

===
UPDATE: I made a abliterated+aligned imatrix v2 quabt so #60 is no longer in play.

@ivanfioravanti
Copy link
Copy Markdown
Contributor Author

Thanks for flagging this. Currently I facing logprob drift in this PR compared to standard metal kernel and CPU, not big but I'm trying to lower it as much as possible. I will then rebase to main and start testing with some coding harness.
This model rocks and this engine too!

@ivanfioravanti ivanfioravanti force-pushed the codex/metal4-m5-scaffold branch from 0fe868d to 645bc1a Compare May 11, 2026 05:06
@ivanfioravanti ivanfioravanti marked this pull request as ready for review May 11, 2026 05:09
@ivanfioravanti
Copy link
Copy Markdown
Contributor Author

logprob drift fixed 🎉 Now I'm more confident from mathematical perspective!
Today I'll test performance, run some evals and try to use this branch with coding agents and hermes. 💪

audreyt added a commit to audreyt/ds4 that referenced this pull request May 11, 2026
Pulls in ivanfioravanti's PR antirez#15 head 645bc1a (logprob drift fix) plus
the audreyt-side resolution work that keeps the cyberneurova
stock-recipe Q8_0 path working end-to-end:

  b3e601a Merge updated PR antirez#15 (logprob drift fix) into m5 branch
  267b200 fix(metal/moe): take ivan's moe.metal verbatim after PR antirez#15 merge
  cb50855 fix(metal): polymorphic dispatch for compressor KV/gate matmul

Validated on M5 Max:
  * cyberneurova Q2_K (stock-recipe Q8_0 compressors): coherent generation
  * antirez Q2_K (F16 compressors): ds4_test --metal-kernels,
    --long-context, --logprob-vectors all OK.

# Conflicts:
#	README.md
@ivanfioravanti ivanfioravanti force-pushed the codex/metal4-m5-scaffold branch from 645bc1a to c264f46 Compare May 11, 2026 14:36
@ivanfioravanti
Copy link
Copy Markdown
Contributor Author

@audreyt I will port the MPP additions here to the new naming on main branch. I will test this PR like crazy after that and then rebase.

@ivanfioravanti ivanfioravanti marked this pull request as draft May 11, 2026 15:16
@ivanfioravanti ivanfioravanti force-pushed the codex/metal4-m5-scaffold branch from c264f46 to ff2d499 Compare May 11, 2026 22:37
@ivanfioravanti
Copy link
Copy Markdown
Contributor Author

We are getting there. I've used ds4-bench to make things easier to compare.

Don't like the MPP name (Metal Performance Primitives), let's use something more related to the new Tensor APIlike Metal Tensor (mt).

There is a difference in logprob between standard metal kernel and this one, so I will now run evals on both to measure impacts. If this is ok, I will ask super @antirez how to proceed.

ds4-bench-standard-vs-mpp

audreyt added a commit to audreyt/ds4 that referenced this pull request May 12, 2026
Brings in Ivan's PR antirez#15 follow-ups (Tune Metal MPP defaults / Improve
MPP prefill throughput / Low-power Q8 profile) plus the ds4_test rename
fix, on top of the swival/m5 work that main already absorbed.

Conflict resolution:
- ds4_metal.m: drop main's older ds4_gpu_mpp_q8_0_partial_tiles_enabled
  in favor of m5's newer version from ff2d499 (handles low-power mode).
- README.md: keep HEAD (the Abliterated fork narrative); m5's README is
  the original DwarfStar 4 readme with Swival's M5 narrative appended.
audreyt added a commit to audreyt/audreyt that referenced this pull request May 13, 2026
The cyberneurova abliterated weights have been re-quantised into the
IQ2XXS-w2Q2K imatrix recipe with F16 token embedding so the stock ds4 main
loader takes them directly — no support-q8_0-token-embd workaround needed.
PR #60 has been closed accordingly.

The §3.1 "audreyt/ds4 main carries (a)+(b)+(c)" list collapses: (a) the
support-q8_0-token-embd loader work and (c) the cyber compressor compat fix
that paired with it are no longer fork differentiators. Only (b)
ivanfioravanti's M5 prefill optimisation (antirez/ds4#15) remains, and that
itself converges once #15 lands upstream.

Also trim the DS4_SUPPORT_BRANCH note that pitched support-q8_0-token-embd
as a fallback branch — that branch now offers strictly less than main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ivanfioravanti
Copy link
Copy Markdown
Contributor Author

ivanfioravanti commented May 13, 2026

I did a full AIME2025 evaluation round of standard metal kernel vs tensor metal with max context 16K and I've got:

  • 81.5% Standard
  • 73.3% Tensor

I'll keep this PR align to main and keep reducing logprob diff while ensuring a performance boost at the same time. 💪

@ivanfioravanti ivanfioravanti force-pushed the codex/metal4-m5-scaffold branch from 8664f51 to 1a93973 Compare May 13, 2026 19:58
ivanfioravanti and others added 16 commits May 13, 2026 22:04
Raise the default Metal prefill chunk to 4096 and reuse the range-capable layer-major prefill graph for chunked ranges.

Enable the guarded Q8_0 attn_q_b MPP route for <=2048-token prompt batches, dynamic Q8_0 tile width, the routed-MoE fast layout from layer 0, and the RB16 indexed decode path.

M5 Max post-patch ds4-bench profile with 64 generated tokens: prompt 443/459/522/486/465 t/s and generation 38.6/38.2/37.6/34.0/33.6 t/s at 0.5k/1k/2k/4k/8k.

Tests: make all ds4_test; make test; git diff --check.
Detect macOS Low Power Mode and widen the Q8_0 prefill MPP route only under that condition, while preserving the guarded default for normal-power runs and explicit Q8_0 filters.

Low-power M5 Max baseline vs patched auto with 128 generated tokens:

0.5k: prefill 133.46 -> 196.89 t/s, gen 13.53 -> 15.08 t/s

1k: prefill 118.65 -> 188.91 t/s, gen 12.23 -> 14.93 t/s

2k: prefill 130.90 -> 220.33 t/s, gen 11.02 -> 14.65 t/s

4k: prefill 118.09 -> 212.81 t/s, gen 13.25 -> 14.00 t/s

8k: prefill 185.52 -> 206.49 t/s, gen 12.94 -> 13.84 t/s

Tests: make all ds4_test; make test; DS4_METAL_MPP_LOW_POWER_DISABLE=1 ./ds4_test --metal-mpp-equivalence; git diff --check.
Carries forward the pending "MPP -> Metal Tensor" naming refactor and adds:
- --dump-logits FILE CLI flag and run_logits_dump() so prefill-time logits
  can be captured for A/B drift comparison.
- bench/compare_logit_drift.py + bench/compare_bench.py + run helper.
- Macro plumbing in ds4_metal.m's library compile step for five env-gated
  drift flags (DS4_METAL_HC_STABLE default-on, DS4_METAL_NORM_RSQRT_DISABLE
  default-on, DS4_METAL_KV_RAW_F32 default-off, DS4_METAL_ROPE_EXP2_LOG2
  default-off, DS4_METAL_TENSOR_MATMUL_DISABLE default-off).
- Logs the active flag set on first device init so test runs are
  self-documenting.

Per-kernel changes that consume each macro land in follow-up commits so
they can be reverted independently if a drift measurement regresses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The HC=4 and scalar Sinkhorn split paths use 1/(1+exp(-z)) directly,
which overflows when z is sufficiently negative (exp(-z) explodes).
M5 Max's faster ALU is more likely than M3/M4 to push HC mixer inputs
into that regime upstream, so the latent fragility may surface as
logprob drift on M5 only.

Replaces 1/(1+exp(-z)) with the identity 0.5*tanh(0.5*z) + 0.5 and
2/(1+exp(-z)) with 1 + tanh(0.5*z). Bounded across the full float
range. The iter-0 vs iter-1+ epsilon application difference is left
intact -- it is mirrored identically in the scalar reference path and
appears to be an intentional Sinkhorn warm-up.

Gated by DS4_METAL_HC_STABLE so the historical form can be A/B'd.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ult on)

kernel_rms_norm_fuse_impl uses 1.0f/sqrt(mean+eps); the fused
kernel_dsv4_qkv_rms_norm_f32_4 was using rsqrt(...) for the same value.
Apple Silicon's hardware rsqrt has implementation-defined precision and
can differ from 1.0f/sqrt by ~1 ULP. Across the 43 layers of DeepSeek
V4 Flash that per-layer ULP drift compounds visibly, and the rounding
gap between rsqrt and div+sqrt isn't guaranteed to match between M3/M4
and M5 hardware families.

Switch the fused QKV norm to 1.0f/sqrt(...) so both norm kernels share
a single formula. Gated by DS4_METAL_NORM_RSQRT_DISABLE so the rsqrt
path can be A/B'd.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kernel_dsv4_kv_fp8_store_f32 deliberately writes the raw cache row as
(float)((half)q) so its precision matches the half-typed FlashAttention
KV buffer the indexer references. With DS4_METAL_KV_RAW_F32 set, the
half cast is skipped and the FP8-dequantized FP32 value is written
verbatim.

This is diagnostic only: enabling it makes the indexer see higher-
precision values than FlashAttention, which is a deliberate mismatch
that reveals how much drift the FP16 quantization contributes but is
not safe to ship. Default off.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Metal's pow(freq_base, k) is not IEEE-754 strict and the rounding can
differ between GPU families. With DS4_METAL_ROPE_EXP2_LOG2 set, the
RoPE angle is computed as exp2(k * log2(freq_base)) instead, using two
primitives with tighter precision specifications.

The change touches both the NeoX and default RoPE branches of
kernel_dsv4_rope_tail_f32. Default off -- this is a diagnostic to
quantify how much RoPE pow precision contributes to logprob drift on
M5 Max relative to M3/M4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the macro un-defines DS4_METAL_HAS_TENSOR at library compile time
the cooperative-tensor _mpp kernel templates are no longer in the
library, but g_metal4_tensor_api_enabled was still truthy so the host
dispatch layer kept attempting to fetch them. The result was a flood
of "Metal kernel kernel_mul_mm_*_mpp_* function not found" warnings on
the legacy fallback path.

Flip g_metal4_tensor_api_enabled = 0 inside the same branch so the
host code's ds4_gpu_use_mpp_*() and ds4_gpu_*_mpp_tensor() guards see
the disabled state and skip _mpp lookups entirely.

Measured on M5 Max with the short reasoning prompt: drift between
-mt off and DS4_METAL_TENSOR_MATMUL_DISABLE=1 -mt auto is now exactly
zero (rms=0, max_abs=0, max_rank_delta=0), confirming that the M5 Max
logprob drift is sourced entirely in the Metal 4 cooperative-tensor
matmul codepath and not in HC, norm, RoPE, or KV.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bisecting the M5 Max logprob drift on -mt auto:
- -mt off baseline: reference
- -mt auto (all routes): rms=0.150, max_abs=0.750, top20=0.263
- -mt auto + DS4_METAL_MPP_Q8_0_DISABLE=1: rms=0, max_abs=0 (exact)
- -mt auto + DS4_METAL_MPP_F16_DISABLE=1: still rms=0.150 (no help)
- -mt auto + DS4_METAL_MPP_ATTN_OUT_DISABLE=1: still rms=0.150
- -mt auto + DS4_METAL_MPP_MOE_{GATE,UP,DOWN}_DISABLE=1: still rms=0.150

The Metal 4 cooperative-tensor Q8_0 matmul (kernel_mul_mm_q8_0_f32_mpp
and direct_rhs variants in dense.metal) is the *sole* drift source on
M5 Max vs the legacy simdgroup_multiply_accumulate path. The other
Tensor routes (F16 compressor, attention-output low projection, routed
MoE gate/up/down) are bit-clean against -mt off.

Flip ds4_gpu_mpp_q8_0_default_target() to return 0 when the device
name contains "M5". Other Tensor routes continue to default on, so the
Q8_0 carve-out preserves the bulk of the Metal Tensor speedup (F16
compressor at layers 0-19, MoE at layers 20+, attn-out at layers 32-42).
Users who care more about prefill throughput than bit-equivalence can
opt back in with DS4_METAL_MPP_Q8_0_ENABLE=1.

Verified on M5 Max with default flags only: -mt auto now produces
exactly the -mt off logits (rms=0, max_abs=0, max_rank_delta=0,
same_top1=yes, top5_overlap=5/5, top20_overlap=20/20).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MTLCompileOptions.fastMathEnabled defaults to YES and Apple's headers
explicitly note this "may violate the IEEE 754 standard". With safe
math forced via MTLMathModeSafe (macOS 15+) or fastMathEnabled=NO
(deprecated fallback), drift between -mt off and -mt auto on M5 Max
shrinks ~4x (rms 0.150 -> 0.037, max_abs 0.75 -> 0.19) -- showing
that fast-math optimizations applied differently across the two
hardware paths were amplifying the underlying matmul2d divergence.

Default OFF: enabling safe math also moves -mt off away from the
fast-math production reference (rms=0.63 vs original fast-math
baseline) so it isn't a drop-in fix. Useful as a diagnostic to
localize remaining drift sources and as an option for users who
prefer strict IEEE-754 semantics over fast-math speed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit (75f0930) added the M5 carve-out by editing
ds4_gpu_mpp_q8_0_default_target(), but that helper was also being
reused as the default-target for ds4_gpu_use_mpp_f16_compressor_matmul
(line 1363) and for the verbose memory-report banner that prints
mpp_f16 (line 2102). That coupled F16 compressor default-on/off to
the Q8 carve-out, which is wrong: the per-route bisection showed F16
is bit-clean on M5; only Q8 needed to flip default-off.

Introduce a dedicated ds4_gpu_mpp_f16_default_target() that always
returns 1 and use it at the two F16 call sites. The Q8 helper keeps
its M5 carve-out unchanged.

Verified on M5 Max with default flags: -mt auto still produces zero
drift vs -mt off (rms=0, max_abs=0, max_rank_delta=0), and the F16
compressor Tensor route is now back to default-on on M5 as intended.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
test_metal_q8_0_mpp_matmul_case() built the reference output by
calling ds4_gpu_matmul_q8_0_tensor() after ds4_gpu_set_quality(false).
The set_quality(false) call enables MPP routing, and the dispatcher at
ds4_metal.m:6277 then routes to ds4_gpu_matmul_q8_0_mpp_tensor() when
the MPP can_use gate passes. So on M5 with Metal 4 tensor API enabled,
the "reference" was actually the MPP output, and the test compared
the MPP kernel to itself -- the max_abs/rms numbers were always near
zero and any divergence in the MPP kernel itself would not have been
caught.

Force ds4_gpu_set_quality(true) around the reference call so the
dispatcher takes the legacy simdgroup_multiply_accumulate path, then
restore set_quality(false) before invoking ds4_gpu_matmul_q8_0_mpp_tensor()
directly for the candidate. The reference and candidate now exercise
the two different code paths the test was originally meant to compare.

Verified on M5 Max: ./ds4_test --metal-kernels still passes, meaning
the M5 cooperative-tensor Q8 matmul agrees with the legacy path within
the 0.10 max-abs kernel target on the test shapes. The systemic drift
in -mt auto comes from many small matmul deltas compounding through
43 layers, not from any single kernel exceeding the per-call threshold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…bers

Two corrections triggered by another reviewer's audit:

1. The auto-suite description claimed "auto enables Q8_0 prefill ...";
   on M5 that is no longer true now that 75f0930 defaults Q8_0 Tensor
   off on M5. Reword the section so it lists F16 compressor, attn-out,
   and MoE as the auto-enabled routes, then call out the M5 carve-out
   for Q8_0 explicitly with the env-var opt-in.

2. Refresh worst-case suite numbers measured on the current branch
   (codex/metal4-m5-drift-patches after the F16-coupling fix 78fa48f
   and the test-self-reference fix 580e896) on M5 Max:

       worst_rms = 0.169   (was documented ~= 0.170)
       worst_top20_max_abs = 0.306   (was documented ~= 0.342)
       worst_max_abs = 0.922
       min_top5_overlap = 5/5
       min_top20_overlap = 20/20 (was 19/20)
       worst_rank_delta = 1

   Three short fixtures (short_italian_fact, short_code_completion,
   short_reasoning_plain) are now bit-exact (rms=0); the residual
   drift is concentrated on the two long-context fixtures and comes
   from the F16 compressor, attention-output, and routed-MoE Tensor
   routes still being default-on, compounding small per-matmul
   deltas through 43 layers.

The Q8_0 isolation paragraph also picks up the M5 default-off note
so the env-var docs stay consistent with the runtime behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ivanfioravanti ivanfioravanti force-pushed the codex/metal4-m5-scaffold branch from 1a93973 to 49c1137 Compare May 13, 2026 20:06
@ivanfioravanti
Copy link
Copy Markdown
Contributor Author

Status update

Quality perspective

Now Tensor Metal Kernel and Standard Metal Kernel are very similar from mathematical perspective and generation equivalent, here the comparison of each vs --quality.

Tensor Metal:
short_italian_fact same_top1=yes top20=20/20 rms=0.370521 top20_abs=0.659939
short_code_completion same_top1=yes top20=19/20 rms=0.463208 top20_abs=2.24006
short_reasoning_plain same_top1=yes top20=18/20 rms=0.618172 top20_abs=1.98223
long_memory_archive same_top1=yes top20=19/20 rms=0.578821 top20_abs=1.10567
long_code_audit same_top1=yes top20=18/20 rms=0.428341 top20_abs=1.04939

Default Metal (Main branch)
short_italian_fact same_top1=yes top20=20/20 rms=0.370521 top20_abs=0.659939
short_code_completion same_top1=yes top20=19/20 rms=0.463208 top20_abs=2.24006
short_reasoning_plain same_top1=yes top20=18/20 rms=0.618172 top20_abs=1.98223
long_memory_archive same_top1=no top20=19/20 rms=0.380069 top20_abs=0.692255
long_code_audit same_top1=yes top20=16/20 rms=0.530931 top20_abs=1.38902

Here the test against the other M5 implementation in Swival ds4-m5
short_italian_fact same_top1=yes top20=19/20 rms=0.355529 top20_abs=1.20708
short_code_completion same_top1=yes top20=16/20 rms=0.689058 top20_abs=1.58729
short_reasoning_plain same_top1=yes top20=20/20 rms=0.556018 top20_abs=1.50008
long_memory_archive same_top1=yes top20=16/20 rms=0.510137 top20_abs=1.65245
long_code_audit same_top1=yes top20=17/20 rms=0.612309 top20_abs=1.23852

From eval perspective the new slower Tensor Metal Kernel has improved AIME2025 reaching 90% in Q2 (unbelievable!)
image

Performance

Here situation is clearly different from previous version, there is still a benefit but more in the range of ~15%.
There is room from improvement and I'll keep pushing this, but with strict bar on logprob drift.

ds4_bench_standard_quality_tensor_128

@antirez I'll leave this as draft and I'll keep rebasing on main, I'll remove the drift prone Q8_0 Tensor route and all related tests, docs, etc.

When you'll get your M5 Max gifted by super @audreyt I'm sure you'll be able to do even more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants