Add Metal 4 M5 prefill optimizations by ivanfioravanti · Pull Request #15 · antirez/ds4

ivanfioravanti · 2026-05-08T18:09:16Z

Summary

enable M5-class Metal 4 MPP prefill paths for Q8_0 dense matmuls, attention-output low projection, and staged routed MoE projections
promote correctness-gated routed MoE boundaries: down from layer 2, gate/up from layer 13
add a fused six-expert routed MoE sum kernel for the common top-k=6 prefill shape
keep experimental probes and escape hatches for ablation (DS4_METAL_MPP_DISABLE=1, DS4_METAL_MOE_SUM6_DISABLE=1, staged routed MoE envs)

Benchmarks

Prompt source: README.md; command shape: ./ds4 --prompt-file <prompt> -n 1 --nothink --ctx 32768; 3 repeats per target.

target tokens	standard non-M5 avg tok/s	M5 no sum6 avg tok/s	M5 default avg tok/s	M5 speedup
512	236.97	390.29	389.47	1.64x
2048	324.90	480.07	478.36	1.47x
4096	289.97	454.41	455.52	1.57x
8192	287.60	440.43	442.69	1.54x
16384	284.26	427.62	427.78	1.50x

8192-token routed MoE stage profile, current M5 default:

stage	total ms	mean per layer ms	share
up	2443.497	11.365	35.1%
gate	2422.942	11.269	34.8%
down	1727.114	8.033	24.8%
activation_weight	142.693	0.664	2.1%
sum	137.482	0.639	2.0%
map	68.601	0.319	1.0%
gate_up	12.389	0.058	0.2%

Disabling routed MoE MPP on the same profile drops prefill from 442.33 tok/s to 359.17 tok/s and raises gate/up/down to about 18.4-18.7 ms per layer.

Validation

make ds4 ds4_test
./ds4_test --metal-kernels
./ds4_test --long-context
./ds4_test --logprob-vectors

Notes

Earlier up boundaries with gate=13 were tested but failed long-context; gate/up stay paired at layer 13.
A paired MPP gate+up matmul prototype compiled and ran but was slower overall, so it was not included.
DS4_METAL_MOE_MID_F32=1 looked slightly faster in a noisy local check, but the result is too small to promote without a broader clean sweep.

antirez · 2026-05-09T11:17:22Z

Very significant speedup. I'm in the middle of a large refactoring so can't merge right now and there is the problem that I miss the hardware to later maintain it, but I'll see what I can do :) Thanks.

ivanfioravanti · 2026-05-09T14:08:00Z

100% with you, we need to sort this out and let you get an M5 😎

ivanfioravanti · 2026-05-09T19:37:44Z

This should solve #14

antirez · 2026-05-09T21:04:42Z

@ivanfioravanti potential idea to get this merged: we take a m5-metal4 branch active, and you try to keep it rebased, if you like the idea. And we document it.

ottaviofogliata · 2026-05-09T21:44:52Z

@ivanfioravanti, just jumping in :) In a couple of days I’ll switch to a M5 Max 128GB. If it could be useful, I’d be more than happy to help you maintain the branch.

ivanfioravanti · 2026-05-09T22:14:21Z

Oh yes @ottaviofogliata join the club! I'm trying to squeeze even more juice with the various optimizations suggested in the Metal Performance Primitives (MPP) Programming Guide without luck so far.

ivanfioravanti · 2026-05-09T22:48:45Z

I squeezed a little more juice. Tomorrow I'll test with pi mono for some coding sessions and I will also get server side stats instead of client side.

ivanfioravanti · 2026-05-10T07:22:35Z

logits are slightly different than the ones created with --quality. Converting to draft while I investigate, greedy is perfect, but distribution behind is different.

This is a personal fork that combines two open upstream PRs (the support-q8_0-token-embd loader PR I sent to antirez/ds4, and ivanfioravanti's PR antirez#15 for M5 Metal 4 prefill optimizations) so I can run unmodified stock-recipe DeepSeek-V4-Flash GGUFs on M5 hardware before either PR lands upstream. The README explains: * What the two combined PRs do and why they're combined here. * Verified test matrix on M5 Max (antirez recipe, cyberneurova stock recipe, cyberneurova pre-harmonized). * The known MPP F16 + cyberneurova interaction (workaround: DS4_METAL_MPP_F16_DISABLE=1) and why it's separate from the loader PR's scope. * Build / run instructions for both recipes. * Acknowledgements to antirez, ivanfioravanti, ggml/llama.cpp, and the cyberneurova research project. * Pointer back to the upstream README for the original design and server/CLI docs (no duplication).

ucjonathan · 2026-05-10T10:53:54Z

If someone at Apple knew what was good for their hardware sales, they would have an M5 studio 256GB and and M5 Macbook Pro 128GB on @antirez desk Monday afternoon. Unfortunately I don't know anyone at Apple, but hopefully there are some developers at Apple watching this project that will wake up to this opportunity.

ivanfioravanti · 2026-05-10T11:26:33Z

@ucjonathan I was going to propose the same thing!

When a stock-recipe GGUF (cyberneurova-style: Q8_0 small tensors, F32 router) is loaded on M5 with PR antirez#15's MPP optimizations enabled, the compressor APE path silently produces wrong output and prefill emits garbage tokens (typically <BOS> spam after a few coherent tokens). The prefill is correct; the bug is in two compressor APE consumers that were updated to accept Q8_0 ape_type but couldn't read Q8_0 byte layout correctly: 1. `kernel_cpy_q8_0_f32` Metal kernel (added in support-q8_0-token-embd for the prefill APE byte-strided dequant): produces silently wrong output on M5 Max for the compressor APE shapes (4 rows x 1024 cols). Replaced with a CPU-side dequant into a per-call private MTLBuffer. The CPU dequant matches gguf-py reference byte-for-byte (verified with a standalone numeric check); the Metal kernel did not. 2. `kernel_dsv4_compressor_store_one` (decode-time single-row store in metal/dsv4_kv.metal): only handled F16/F32 ape_type; Q8_0 fell into the F32 else branch and read garbage. Add a Q8_0 branch that walks block_q8_0 layout (uint16_t scale + 32 int8 quants per 34-byte block) directly. The CPU dequant path also has to use a *fresh per-call* MTLBuffer for each compressor invocation, not the shared g_compressor_store_ape_buffer: multiple CPU writes to one shared buffer in the same command buffer collapse to the last write at execute time (Metal kernels run in encode order, but CPU writes don't participate in that ordering when the same scratch is reused). The per-call buffer is retained until cb completion via addCompletedHandler because Metal does not strongly retain buffers bound to encoders. Changes: * ds4_metal.m: new `ds4_metal_half_bits_to_float` and `ds4_metal_cpu_dequant_q8_0_rows` helpers (verified against gguf-py `dequantize` reference); replace Q8_0 branches in `ds4_metal_encode_compressor_score_with_ape` and `ds4_metal_compressor_store_batch_tensor` with CPU-side dequant into per-call private buffers retained via addCompletedHandler. * metal/dsv4_kv.metal: add a Q8_0 branch to `kernel_dsv4_compressor_store_one`. * metal/cpy.metal: `kernel_cpy_q8_0_f32` is left in place but no longer reached from the compressor paths (its registration in ds4_metal.m is harmless). Tested on macOS / M-series / Metal: * make ds4-server clean. * Cyberneurova Q2_K GGUF entirely unmodified, MPP F16 enabled (i.e. no DS4_METAL_MPP_F16_DISABLE workaround): 21-token prompt -> coherent generation ("An LLM, or Large Language Model, is a type of artificial intelligence"). Previously this prompt generated "An LLM, or large language" then <BOS> token spam. * Pre-harmonized variant: still works byte-for-byte the same as before this change, no F16/F32 path regressions.

…8_0 fixes) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

audreyt · 2026-05-10T19:00:09Z

Forward-compat heads-up while you're rebasing: there's a latent compressor-APE interaction between this PR and Q8_0 ape ingestion (e.g. for stock-recipe GGUFs like cyberneurova's). On M5 Max:

kernel_dsv4_compressor_store_one (in metal/dsv4_kv.metal, upstream) only handles F16/F32 ape — Q8_0 falls through to the F32 branch and reads garbage.
A naive Q8_0 GPU dequant kernel for the prefill APE byte-strided path produces silently wrong output on M5 Max for compressor APE shapes (4 rows × 1024 cols). CPU-side dequant into a per-call private MTLBuffer is what works — a shared scratch loses writes when the same buffer is reused inside one command buffer (Metal kernels run in encode order; CPU writes don't participate in that ordering when the scratch is reused).

The bug only manifests when both your MPP work and Q8_0 ape branches are present, so nothing to fix in this PR in isolation. Just flagging in case #60 (the stock-recipe loader PR) lands — the fix at audreyt/ds4 m5-support-q8_0-token-embd (commit 79b08bb) would need to land alongside it. Happy to coordinate.

===
UPDATE: I made a abliterated+aligned imatrix v2 quabt so #60 is no longer in play.

ivanfioravanti · 2026-05-10T19:31:18Z

Thanks for flagging this. Currently I facing logprob drift in this PR compared to standard metal kernel and CPU, not big but I'm trying to lower it as much as possible. I will then rebase to main and start testing with some coding harness.
This model rocks and this engine too!

ivanfioravanti · 2026-05-11T05:11:02Z

logprob drift fixed 🎉 Now I'm more confident from mathematical perspective!
Today I'll test performance, run some evals and try to use this branch with coding agents and hermes. 💪

Pulls in ivanfioravanti's PR antirez#15 head 645bc1a (logprob drift fix) plus the audreyt-side resolution work that keeps the cyberneurova stock-recipe Q8_0 path working end-to-end: b3e601a Merge updated PR antirez#15 (logprob drift fix) into m5 branch 267b200 fix(metal/moe): take ivan's moe.metal verbatim after PR antirez#15 merge cb50855 fix(metal): polymorphic dispatch for compressor KV/gate matmul Validated on M5 Max: * cyberneurova Q2_K (stock-recipe Q8_0 compressors): coherent generation * antirez Q2_K (F16 compressors): ds4_test --metal-kernels, --long-context, --logprob-vectors all OK. # Conflicts: # README.md

ivanfioravanti · 2026-05-11T14:46:52Z

@audreyt I will port the MPP additions here to the new naming on main branch. I will test this PR like crazy after that and then rebase.

ivanfioravanti · 2026-05-12T05:51:00Z

We are getting there. I've used ds4-bench to make things easier to compare.

Don't like the MPP name (Metal Performance Primitives), let's use something more related to the new Tensor APIlike Metal Tensor (mt).

There is a difference in logprob between standard metal kernel and this one, so I will now run evals on both to measure impacts. If this is ok, I will ask super @antirez how to proceed.

Brings in Ivan's PR antirez#15 follow-ups (Tune Metal MPP defaults / Improve MPP prefill throughput / Low-power Q8 profile) plus the ds4_test rename fix, on top of the swival/m5 work that main already absorbed. Conflict resolution: - ds4_metal.m: drop main's older ds4_gpu_mpp_q8_0_partial_tiles_enabled in favor of m5's newer version from ff2d499 (handles low-power mode). - README.md: keep HEAD (the Abliterated fork narrative); m5's README is the original DwarfStar 4 readme with Swival's M5 narrative appended.

The cyberneurova abliterated weights have been re-quantised into the IQ2XXS-w2Q2K imatrix recipe with F16 token embedding so the stock ds4 main loader takes them directly — no support-q8_0-token-embd workaround needed. PR #60 has been closed accordingly. The §3.1 "audreyt/ds4 main carries (a)+(b)+(c)" list collapses: (a) the support-q8_0-token-embd loader work and (c) the cyber compressor compat fix that paired with it are no longer fork differentiators. Only (b) ivanfioravanti's M5 prefill optimisation (antirez/ds4#15) remains, and that itself converges once #15 lands upstream. Also trim the DS4_SUPPORT_BRANCH note that pitched support-q8_0-token-embd as a fallback branch — that branch now offers strictly less than main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ivanfioravanti · 2026-05-13T11:35:16Z

I did a full AIME2025 evaluation round of standard metal kernel vs tensor metal with max context 16K and I've got:

81.5% Standard
73.3% Tensor

I'll keep this PR align to main and keep reducing logprob diff while ensuring a performance boost at the same time. 💪

Raise the default Metal prefill chunk to 4096 and reuse the range-capable layer-major prefill graph for chunked ranges. Enable the guarded Q8_0 attn_q_b MPP route for <=2048-token prompt batches, dynamic Q8_0 tile width, the routed-MoE fast layout from layer 0, and the RB16 indexed decode path. M5 Max post-patch ds4-bench profile with 64 generated tokens: prompt 443/459/522/486/465 t/s and generation 38.6/38.2/37.6/34.0/33.6 t/s at 0.5k/1k/2k/4k/8k. Tests: make all ds4_test; make test; git diff --check.

Detect macOS Low Power Mode and widen the Q8_0 prefill MPP route only under that condition, while preserving the guarded default for normal-power runs and explicit Q8_0 filters. Low-power M5 Max baseline vs patched auto with 128 generated tokens: 0.5k: prefill 133.46 -> 196.89 t/s, gen 13.53 -> 15.08 t/s 1k: prefill 118.65 -> 188.91 t/s, gen 12.23 -> 14.93 t/s 2k: prefill 130.90 -> 220.33 t/s, gen 11.02 -> 14.65 t/s 4k: prefill 118.09 -> 212.81 t/s, gen 13.25 -> 14.00 t/s 8k: prefill 185.52 -> 206.49 t/s, gen 12.94 -> 13.84 t/s Tests: make all ds4_test; make test; DS4_METAL_MPP_LOW_POWER_DISABLE=1 ./ds4_test --metal-mpp-equivalence; git diff --check.

Carries forward the pending "MPP -> Metal Tensor" naming refactor and adds: - --dump-logits FILE CLI flag and run_logits_dump() so prefill-time logits can be captured for A/B drift comparison. - bench/compare_logit_drift.py + bench/compare_bench.py + run helper. - Macro plumbing in ds4_metal.m's library compile step for five env-gated drift flags (DS4_METAL_HC_STABLE default-on, DS4_METAL_NORM_RSQRT_DISABLE default-on, DS4_METAL_KV_RAW_F32 default-off, DS4_METAL_ROPE_EXP2_LOG2 default-off, DS4_METAL_TENSOR_MATMUL_DISABLE default-off). - Logs the active flag set on first device init so test runs are self-documenting. Per-kernel changes that consume each macro land in follow-up commits so they can be reverted independently if a drift measurement regresses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The HC=4 and scalar Sinkhorn split paths use 1/(1+exp(-z)) directly, which overflows when z is sufficiently negative (exp(-z) explodes). M5 Max's faster ALU is more likely than M3/M4 to push HC mixer inputs into that regime upstream, so the latent fragility may surface as logprob drift on M5 only. Replaces 1/(1+exp(-z)) with the identity 0.5*tanh(0.5*z) + 0.5 and 2/(1+exp(-z)) with 1 + tanh(0.5*z). Bounded across the full float range. The iter-0 vs iter-1+ epsilon application difference is left intact -- it is mirrored identically in the scalar reference path and appears to be an intentional Sinkhorn warm-up. Gated by DS4_METAL_HC_STABLE so the historical form can be A/B'd. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ult on) kernel_rms_norm_fuse_impl uses 1.0f/sqrt(mean+eps); the fused kernel_dsv4_qkv_rms_norm_f32_4 was using rsqrt(...) for the same value. Apple Silicon's hardware rsqrt has implementation-defined precision and can differ from 1.0f/sqrt by ~1 ULP. Across the 43 layers of DeepSeek V4 Flash that per-layer ULP drift compounds visibly, and the rounding gap between rsqrt and div+sqrt isn't guaranteed to match between M3/M4 and M5 hardware families. Switch the fused QKV norm to 1.0f/sqrt(...) so both norm kernels share a single formula. Gated by DS4_METAL_NORM_RSQRT_DISABLE so the rsqrt path can be A/B'd. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kernel_dsv4_kv_fp8_store_f32 deliberately writes the raw cache row as (float)((half)q) so its precision matches the half-typed FlashAttention KV buffer the indexer references. With DS4_METAL_KV_RAW_F32 set, the half cast is skipped and the FP8-dequantized FP32 value is written verbatim. This is diagnostic only: enabling it makes the indexer see higher- precision values than FlashAttention, which is a deliberate mismatch that reveals how much drift the FP16 quantization contributes but is not safe to ship. Default off. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Metal's pow(freq_base, k) is not IEEE-754 strict and the rounding can differ between GPU families. With DS4_METAL_ROPE_EXP2_LOG2 set, the RoPE angle is computed as exp2(k * log2(freq_base)) instead, using two primitives with tighter precision specifications. The change touches both the NeoX and default RoPE branches of kernel_dsv4_rope_tail_f32. Default off -- this is a diagnostic to quantify how much RoPE pow precision contributes to logprob drift on M5 Max relative to M3/M4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When the macro un-defines DS4_METAL_HAS_TENSOR at library compile time the cooperative-tensor _mpp kernel templates are no longer in the library, but g_metal4_tensor_api_enabled was still truthy so the host dispatch layer kept attempting to fetch them. The result was a flood of "Metal kernel kernel_mul_mm_*_mpp_* function not found" warnings on the legacy fallback path. Flip g_metal4_tensor_api_enabled = 0 inside the same branch so the host code's ds4_gpu_use_mpp_*() and ds4_gpu_*_mpp_tensor() guards see the disabled state and skip _mpp lookups entirely. Measured on M5 Max with the short reasoning prompt: drift between -mt off and DS4_METAL_TENSOR_MATMUL_DISABLE=1 -mt auto is now exactly zero (rms=0, max_abs=0, max_rank_delta=0), confirming that the M5 Max logprob drift is sourced entirely in the Metal 4 cooperative-tensor matmul codepath and not in HC, norm, RoPE, or KV. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bisecting the M5 Max logprob drift on -mt auto: - -mt off baseline: reference - -mt auto (all routes): rms=0.150, max_abs=0.750, top20=0.263 - -mt auto + DS4_METAL_MPP_Q8_0_DISABLE=1: rms=0, max_abs=0 (exact) - -mt auto + DS4_METAL_MPP_F16_DISABLE=1: still rms=0.150 (no help) - -mt auto + DS4_METAL_MPP_ATTN_OUT_DISABLE=1: still rms=0.150 - -mt auto + DS4_METAL_MPP_MOE_{GATE,UP,DOWN}_DISABLE=1: still rms=0.150 The Metal 4 cooperative-tensor Q8_0 matmul (kernel_mul_mm_q8_0_f32_mpp and direct_rhs variants in dense.metal) is the *sole* drift source on M5 Max vs the legacy simdgroup_multiply_accumulate path. The other Tensor routes (F16 compressor, attention-output low projection, routed MoE gate/up/down) are bit-clean against -mt off. Flip ds4_gpu_mpp_q8_0_default_target() to return 0 when the device name contains "M5". Other Tensor routes continue to default on, so the Q8_0 carve-out preserves the bulk of the Metal Tensor speedup (F16 compressor at layers 0-19, MoE at layers 20+, attn-out at layers 32-42). Users who care more about prefill throughput than bit-equivalence can opt back in with DS4_METAL_MPP_Q8_0_ENABLE=1. Verified on M5 Max with default flags only: -mt auto now produces exactly the -mt off logits (rms=0, max_abs=0, max_rank_delta=0, same_top1=yes, top5_overlap=5/5, top20_overlap=20/20). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MTLCompileOptions.fastMathEnabled defaults to YES and Apple's headers explicitly note this "may violate the IEEE 754 standard". With safe math forced via MTLMathModeSafe (macOS 15+) or fastMathEnabled=NO (deprecated fallback), drift between -mt off and -mt auto on M5 Max shrinks ~4x (rms 0.150 -> 0.037, max_abs 0.75 -> 0.19) -- showing that fast-math optimizations applied differently across the two hardware paths were amplifying the underlying matmul2d divergence. Default OFF: enabling safe math also moves -mt off away from the fast-math production reference (rms=0.63 vs original fast-math baseline) so it isn't a drop-in fix. Useful as a diagnostic to localize remaining drift sources and as an option for users who prefer strict IEEE-754 semantics over fast-math speed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous commit (75f0930) added the M5 carve-out by editing ds4_gpu_mpp_q8_0_default_target(), but that helper was also being reused as the default-target for ds4_gpu_use_mpp_f16_compressor_matmul (line 1363) and for the verbose memory-report banner that prints mpp_f16 (line 2102). That coupled F16 compressor default-on/off to the Q8 carve-out, which is wrong: the per-route bisection showed F16 is bit-clean on M5; only Q8 needed to flip default-off. Introduce a dedicated ds4_gpu_mpp_f16_default_target() that always returns 1 and use it at the two F16 call sites. The Q8 helper keeps its M5 carve-out unchanged. Verified on M5 Max with default flags: -mt auto still produces zero drift vs -mt off (rms=0, max_abs=0, max_rank_delta=0), and the F16 compressor Tensor route is now back to default-on on M5 as intended. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test_metal_q8_0_mpp_matmul_case() built the reference output by calling ds4_gpu_matmul_q8_0_tensor() after ds4_gpu_set_quality(false). The set_quality(false) call enables MPP routing, and the dispatcher at ds4_metal.m:6277 then routes to ds4_gpu_matmul_q8_0_mpp_tensor() when the MPP can_use gate passes. So on M5 with Metal 4 tensor API enabled, the "reference" was actually the MPP output, and the test compared the MPP kernel to itself -- the max_abs/rms numbers were always near zero and any divergence in the MPP kernel itself would not have been caught. Force ds4_gpu_set_quality(true) around the reference call so the dispatcher takes the legacy simdgroup_multiply_accumulate path, then restore set_quality(false) before invoking ds4_gpu_matmul_q8_0_mpp_tensor() directly for the candidate. The reference and candidate now exercise the two different code paths the test was originally meant to compare. Verified on M5 Max: ./ds4_test --metal-kernels still passes, meaning the M5 cooperative-tensor Q8 matmul agrees with the legacy path within the 0.10 max-abs kernel target on the test shapes. The systemic drift in -mt auto comes from many small matmul deltas compounding through 43 layers, not from any single kernel exceeding the per-call threshold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…bers Two corrections triggered by another reviewer's audit: 1. The auto-suite description claimed "auto enables Q8_0 prefill ..."; on M5 that is no longer true now that 75f0930 defaults Q8_0 Tensor off on M5. Reword the section so it lists F16 compressor, attn-out, and MoE as the auto-enabled routes, then call out the M5 carve-out for Q8_0 explicitly with the env-var opt-in. 2. Refresh worst-case suite numbers measured on the current branch (codex/metal4-m5-drift-patches after the F16-coupling fix 78fa48f and the test-self-reference fix 580e896) on M5 Max: worst_rms = 0.169 (was documented ~= 0.170) worst_top20_max_abs = 0.306 (was documented ~= 0.342) worst_max_abs = 0.922 min_top5_overlap = 5/5 min_top20_overlap = 20/20 (was 19/20) worst_rank_delta = 1 Three short fixtures (short_italian_fact, short_code_completion, short_reasoning_plain) are now bit-exact (rms=0); the residual drift is concentrated on the two long-context fixtures and comes from the F16 compressor, attention-output, and routed-MoE Tensor routes still being default-on, compounding small per-matmul deltas through 43 layers. The Q8_0 isolation paragraph also picks up the M5 default-off note so the env-var docs stay consistent with the runtime behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ivanfioravanti · 2026-05-13T21:44:55Z

Status update

Quality perspective

Now Tensor Metal Kernel and Standard Metal Kernel are very similar from mathematical perspective and generation equivalent, here the comparison of each vs --quality.

Tensor Metal:
short_italian_fact same_top1=yes top20=20/20 rms=0.370521 top20_abs=0.659939
short_code_completion same_top1=yes top20=19/20 rms=0.463208 top20_abs=2.24006
short_reasoning_plain same_top1=yes top20=18/20 rms=0.618172 top20_abs=1.98223
long_memory_archive same_top1=yes top20=19/20 rms=0.578821 top20_abs=1.10567
long_code_audit same_top1=yes top20=18/20 rms=0.428341 top20_abs=1.04939

Default Metal (Main branch)
short_italian_fact same_top1=yes top20=20/20 rms=0.370521 top20_abs=0.659939
short_code_completion same_top1=yes top20=19/20 rms=0.463208 top20_abs=2.24006
short_reasoning_plain same_top1=yes top20=18/20 rms=0.618172 top20_abs=1.98223
long_memory_archive same_top1=no top20=19/20 rms=0.380069 top20_abs=0.692255
long_code_audit same_top1=yes top20=16/20 rms=0.530931 top20_abs=1.38902

Here the test against the other M5 implementation in Swival ds4-m5
short_italian_fact same_top1=yes top20=19/20 rms=0.355529 top20_abs=1.20708
short_code_completion same_top1=yes top20=16/20 rms=0.689058 top20_abs=1.58729
short_reasoning_plain same_top1=yes top20=20/20 rms=0.556018 top20_abs=1.50008
long_memory_archive same_top1=yes top20=16/20 rms=0.510137 top20_abs=1.65245
long_code_audit same_top1=yes top20=17/20 rms=0.612309 top20_abs=1.23852

From eval perspective the new slower Tensor Metal Kernel has improved AIME2025 reaching 90% in Q2 (unbelievable!)

Performance

Here situation is clearly different from previous version, there is still a benefit but more in the range of ~15%.
There is room from improvement and I'll keep pushing this, but with strict bar on logprob drift.

@antirez I'll leave this as draft and I'll keep rebasing on main, I'll remove the drift prone Q8_0 Tensor route and all related tests, docs, etc.

When you'll get your M5 Max gifted by super @audreyt I'm sure you'll be able to do even more.

ivanfioravanti marked this pull request as ready for review May 8, 2026 19:28

ivanfioravanti force-pushed the codex/metal4-m5-scaffold branch from 980ba1a to 68547f8 Compare May 9, 2026 22:31

ivanfioravanti marked this pull request as draft May 10, 2026 07:21

This comment was marked as resolved.

Sign in to view

ivanfioravanti force-pushed the codex/metal4-m5-scaffold branch from 68547f8 to b703636 Compare May 10, 2026 14:54

audreyt added a commit to audreyt/ds4 that referenced this pull request May 10, 2026

docs(readme): describe audreyt fork (antirez + ivan PR antirez#15 + Q…

48eb974

…8_0 fixes) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ivanfioravanti force-pushed the codex/metal4-m5-scaffold branch from 0fe868d to 645bc1a Compare May 11, 2026 05:06

ivanfioravanti marked this pull request as ready for review May 11, 2026 05:09

ivanfioravanti force-pushed the codex/metal4-m5-scaffold branch from 645bc1a to c264f46 Compare May 11, 2026 14:36

antirez added speed metal labels May 11, 2026

ivanfioravanti marked this pull request as draft May 11, 2026 15:16

ivanfioravanti force-pushed the codex/metal4-m5-scaffold branch from c264f46 to ff2d499 Compare May 11, 2026 22:37

audreyt mentioned this pull request May 13, 2026

feat(loader): support stock-recipe (Q8_0/F32) abliterated GGUFs end-to-end on Metal #60

Closed

ivanfioravanti force-pushed the codex/metal4-m5-scaffold branch from 8664f51 to 1a93973 Compare May 13, 2026 19:58

ivanfioravanti and others added 16 commits May 13, 2026 22:04

Add Metal 4 M5 scaffold

87c6d3e

Improve Metal MPP diagnostics and safe defaults

a50dd90

Tune Metal MPP defaults and thinking checkpoints

e823fe2

ivanfioravanti force-pushed the codex/metal4-m5-scaffold branch from 1a93973 to 49c1137 Compare May 13, 2026 20:06

Conversation

ivanfioravanti commented May 8, 2026

Summary

Benchmarks

Validation

Notes

Uh oh!

antirez commented May 9, 2026

Uh oh!

ivanfioravanti commented May 9, 2026

Uh oh!

ivanfioravanti commented May 9, 2026

Uh oh!

antirez commented May 9, 2026

Uh oh!

ottaviofogliata commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivanfioravanti commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivanfioravanti commented May 9, 2026

Uh oh!

ivanfioravanti commented May 10, 2026

Uh oh!

ucjonathan commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivanfioravanti commented May 10, 2026

Uh oh!

This comment was marked as resolved.

audreyt commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivanfioravanti commented May 10, 2026

Uh oh!

ivanfioravanti commented May 11, 2026

Uh oh!

ivanfioravanti commented May 11, 2026

Uh oh!

ivanfioravanti commented May 12, 2026

Uh oh!

ivanfioravanti commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivanfioravanti commented May 13, 2026

Status update

Quality perspective

Performance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ottaviofogliata commented May 9, 2026 •

edited

Loading

ivanfioravanti commented May 9, 2026 •

edited

Loading

ucjonathan commented May 10, 2026 •

edited

Loading

audreyt commented May 10, 2026 •

edited

Loading

ivanfioravanti commented May 13, 2026 •

edited

Loading