Add Metal 4 M5 prefill optimizations#15
Conversation
|
Very significant speedup. I'm in the middle of a large refactoring so can't merge right now and there is the problem that I miss the hardware to later maintain it, but I'll see what I can do :) Thanks. |
|
100% with you, we need to sort this out and let you get an M5 😎 |
|
This should solve #14 |
|
@ivanfioravanti potential idea to get this merged: we take a m5-metal4 branch active, and you try to keep it rebased, if you like the idea. And we document it. |
|
@ivanfioravanti, just jumping in :) In a couple of days I’ll switch to a M5 Max 128GB. If it could be useful, I’d be more than happy to help you maintain the branch. |
|
Oh yes @ottaviofogliata join the club! I'm trying to squeeze even more juice with the various optimizations suggested in the Metal Performance Primitives (MPP) Programming Guide without luck so far. |
980ba1a to
68547f8
Compare
|
logits are slightly different than the ones created with --quality. Converting to draft while I investigate, greedy is perfect, but distribution behind is different. |
This is a personal fork that combines two open upstream PRs (the support-q8_0-token-embd loader PR I sent to antirez/ds4, and ivanfioravanti's PR antirez#15 for M5 Metal 4 prefill optimizations) so I can run unmodified stock-recipe DeepSeek-V4-Flash GGUFs on M5 hardware before either PR lands upstream. The README explains: * What the two combined PRs do and why they're combined here. * Verified test matrix on M5 Max (antirez recipe, cyberneurova stock recipe, cyberneurova pre-harmonized). * The known MPP F16 + cyberneurova interaction (workaround: DS4_METAL_MPP_F16_DISABLE=1) and why it's separate from the loader PR's scope. * Build / run instructions for both recipes. * Acknowledgements to antirez, ivanfioravanti, ggml/llama.cpp, and the cyberneurova research project. * Pointer back to the upstream README for the original design and server/CLI docs (no duplication).
This is a personal fork that combines two open upstream PRs (the support-q8_0-token-embd loader PR I sent to antirez/ds4, and ivanfioravanti's PR antirez#15 for M5 Metal 4 prefill optimizations) so I can run unmodified stock-recipe DeepSeek-V4-Flash GGUFs on M5 hardware before either PR lands upstream. The README explains: * What the two combined PRs do and why they're combined here. * Verified test matrix on M5 Max (antirez recipe, cyberneurova stock recipe, cyberneurova pre-harmonized). * The known MPP F16 + cyberneurova interaction (workaround: DS4_METAL_MPP_F16_DISABLE=1) and why it's separate from the loader PR's scope. * Build / run instructions for both recipes. * Acknowledgements to antirez, ivanfioravanti, ggml/llama.cpp, and the cyberneurova research project. * Pointer back to the upstream README for the original design and server/CLI docs (no duplication).
|
If someone at Apple knew what was good for their hardware sales, they would have an M5 studio 256GB and and M5 Macbook Pro 128GB on @antirez desk Monday afternoon. Unfortunately I don't know anyone at Apple, but hopefully there are some developers at Apple watching this project that will wake up to this opportunity. |
|
@ucjonathan I was going to propose the same thing! |
This comment was marked as resolved.
This comment was marked as resolved.
When a stock-recipe GGUF (cyberneurova-style: Q8_0 small tensors, F32 router) is loaded on M5 with PR antirez#15's MPP optimizations enabled, the compressor APE path silently produces wrong output and prefill emits garbage tokens (typically <BOS> spam after a few coherent tokens). The prefill is correct; the bug is in two compressor APE consumers that were updated to accept Q8_0 ape_type but couldn't read Q8_0 byte layout correctly: 1. `kernel_cpy_q8_0_f32` Metal kernel (added in support-q8_0-token-embd for the prefill APE byte-strided dequant): produces silently wrong output on M5 Max for the compressor APE shapes (4 rows x 1024 cols). Replaced with a CPU-side dequant into a per-call private MTLBuffer. The CPU dequant matches gguf-py reference byte-for-byte (verified with a standalone numeric check); the Metal kernel did not. 2. `kernel_dsv4_compressor_store_one` (decode-time single-row store in metal/dsv4_kv.metal): only handled F16/F32 ape_type; Q8_0 fell into the F32 else branch and read garbage. Add a Q8_0 branch that walks block_q8_0 layout (uint16_t scale + 32 int8 quants per 34-byte block) directly. The CPU dequant path also has to use a *fresh per-call* MTLBuffer for each compressor invocation, not the shared g_compressor_store_ape_buffer: multiple CPU writes to one shared buffer in the same command buffer collapse to the last write at execute time (Metal kernels run in encode order, but CPU writes don't participate in that ordering when the same scratch is reused). The per-call buffer is retained until cb completion via addCompletedHandler because Metal does not strongly retain buffers bound to encoders. Changes: * ds4_metal.m: new `ds4_metal_half_bits_to_float` and `ds4_metal_cpu_dequant_q8_0_rows` helpers (verified against gguf-py `dequantize` reference); replace Q8_0 branches in `ds4_metal_encode_compressor_score_with_ape` and `ds4_metal_compressor_store_batch_tensor` with CPU-side dequant into per-call private buffers retained via addCompletedHandler. * metal/dsv4_kv.metal: add a Q8_0 branch to `kernel_dsv4_compressor_store_one`. * metal/cpy.metal: `kernel_cpy_q8_0_f32` is left in place but no longer reached from the compressor paths (its registration in ds4_metal.m is harmless). Tested on macOS / M-series / Metal: * make ds4-server clean. * Cyberneurova Q2_K GGUF entirely unmodified, MPP F16 enabled (i.e. no DS4_METAL_MPP_F16_DISABLE workaround): 21-token prompt -> coherent generation ("An LLM, or Large Language Model, is a type of artificial intelligence"). Previously this prompt generated "An LLM, or large language" then <BOS> token spam. * Pre-harmonized variant: still works byte-for-byte the same as before this change, no F16/F32 path regressions.
68547f8 to
b703636
Compare
…8_0 fixes) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Forward-compat heads-up while you're rebasing: there's a latent compressor-APE interaction between this PR and Q8_0 ape ingestion (e.g. for stock-recipe GGUFs like cyberneurova's). On M5 Max:
The bug only manifests when both your MPP work and Q8_0 ape branches are present, so nothing to fix in this PR in isolation. Just flagging in case #60 (the stock-recipe loader PR) lands — the fix at === |
|
Thanks for flagging this. Currently I facing logprob drift in this PR compared to standard metal kernel and CPU, not big but I'm trying to lower it as much as possible. I will then rebase to main and start testing with some coding harness. |
0fe868d to
645bc1a
Compare
|
logprob drift fixed 🎉 Now I'm more confident from mathematical perspective! |
Pulls in ivanfioravanti's PR antirez#15 head 645bc1a (logprob drift fix) plus the audreyt-side resolution work that keeps the cyberneurova stock-recipe Q8_0 path working end-to-end: b3e601a Merge updated PR antirez#15 (logprob drift fix) into m5 branch 267b200 fix(metal/moe): take ivan's moe.metal verbatim after PR antirez#15 merge cb50855 fix(metal): polymorphic dispatch for compressor KV/gate matmul Validated on M5 Max: * cyberneurova Q2_K (stock-recipe Q8_0 compressors): coherent generation * antirez Q2_K (F16 compressors): ds4_test --metal-kernels, --long-context, --logprob-vectors all OK. # Conflicts: # README.md
645bc1a to
c264f46
Compare
|
@audreyt I will port the MPP additions here to the new naming on main branch. I will test this PR like crazy after that and then rebase. |
c264f46 to
ff2d499
Compare
|
We are getting there. I've used ds4-bench to make things easier to compare. Don't like the MPP name (Metal Performance Primitives), let's use something more related to the new Tensor APIlike Metal Tensor (mt). There is a difference in logprob between standard metal kernel and this one, so I will now run evals on both to measure impacts. If this is ok, I will ask super @antirez how to proceed.
|
Brings in Ivan's PR antirez#15 follow-ups (Tune Metal MPP defaults / Improve MPP prefill throughput / Low-power Q8 profile) plus the ds4_test rename fix, on top of the swival/m5 work that main already absorbed. Conflict resolution: - ds4_metal.m: drop main's older ds4_gpu_mpp_q8_0_partial_tiles_enabled in favor of m5's newer version from ff2d499 (handles low-power mode). - README.md: keep HEAD (the Abliterated fork narrative); m5's README is the original DwarfStar 4 readme with Swival's M5 narrative appended.
The cyberneurova abliterated weights have been re-quantised into the IQ2XXS-w2Q2K imatrix recipe with F16 token embedding so the stock ds4 main loader takes them directly — no support-q8_0-token-embd workaround needed. PR #60 has been closed accordingly. The §3.1 "audreyt/ds4 main carries (a)+(b)+(c)" list collapses: (a) the support-q8_0-token-embd loader work and (c) the cyber compressor compat fix that paired with it are no longer fork differentiators. Only (b) ivanfioravanti's M5 prefill optimisation (antirez/ds4#15) remains, and that itself converges once #15 lands upstream. Also trim the DS4_SUPPORT_BRANCH note that pitched support-q8_0-token-embd as a fallback branch — that branch now offers strictly less than main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
I did a full AIME2025 evaluation round of standard metal kernel vs tensor metal with max context 16K and I've got:
I'll keep this PR align to main and keep reducing logprob diff while ensuring a performance boost at the same time. 💪 |
8664f51 to
1a93973
Compare
Raise the default Metal prefill chunk to 4096 and reuse the range-capable layer-major prefill graph for chunked ranges. Enable the guarded Q8_0 attn_q_b MPP route for <=2048-token prompt batches, dynamic Q8_0 tile width, the routed-MoE fast layout from layer 0, and the RB16 indexed decode path. M5 Max post-patch ds4-bench profile with 64 generated tokens: prompt 443/459/522/486/465 t/s and generation 38.6/38.2/37.6/34.0/33.6 t/s at 0.5k/1k/2k/4k/8k. Tests: make all ds4_test; make test; git diff --check.
Detect macOS Low Power Mode and widen the Q8_0 prefill MPP route only under that condition, while preserving the guarded default for normal-power runs and explicit Q8_0 filters. Low-power M5 Max baseline vs patched auto with 128 generated tokens: 0.5k: prefill 133.46 -> 196.89 t/s, gen 13.53 -> 15.08 t/s 1k: prefill 118.65 -> 188.91 t/s, gen 12.23 -> 14.93 t/s 2k: prefill 130.90 -> 220.33 t/s, gen 11.02 -> 14.65 t/s 4k: prefill 118.09 -> 212.81 t/s, gen 13.25 -> 14.00 t/s 8k: prefill 185.52 -> 206.49 t/s, gen 12.94 -> 13.84 t/s Tests: make all ds4_test; make test; DS4_METAL_MPP_LOW_POWER_DISABLE=1 ./ds4_test --metal-mpp-equivalence; git diff --check.
Carries forward the pending "MPP -> Metal Tensor" naming refactor and adds: - --dump-logits FILE CLI flag and run_logits_dump() so prefill-time logits can be captured for A/B drift comparison. - bench/compare_logit_drift.py + bench/compare_bench.py + run helper. - Macro plumbing in ds4_metal.m's library compile step for five env-gated drift flags (DS4_METAL_HC_STABLE default-on, DS4_METAL_NORM_RSQRT_DISABLE default-on, DS4_METAL_KV_RAW_F32 default-off, DS4_METAL_ROPE_EXP2_LOG2 default-off, DS4_METAL_TENSOR_MATMUL_DISABLE default-off). - Logs the active flag set on first device init so test runs are self-documenting. Per-kernel changes that consume each macro land in follow-up commits so they can be reverted independently if a drift measurement regresses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The HC=4 and scalar Sinkhorn split paths use 1/(1+exp(-z)) directly, which overflows when z is sufficiently negative (exp(-z) explodes). M5 Max's faster ALU is more likely than M3/M4 to push HC mixer inputs into that regime upstream, so the latent fragility may surface as logprob drift on M5 only. Replaces 1/(1+exp(-z)) with the identity 0.5*tanh(0.5*z) + 0.5 and 2/(1+exp(-z)) with 1 + tanh(0.5*z). Bounded across the full float range. The iter-0 vs iter-1+ epsilon application difference is left intact -- it is mirrored identically in the scalar reference path and appears to be an intentional Sinkhorn warm-up. Gated by DS4_METAL_HC_STABLE so the historical form can be A/B'd. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ult on) kernel_rms_norm_fuse_impl uses 1.0f/sqrt(mean+eps); the fused kernel_dsv4_qkv_rms_norm_f32_4 was using rsqrt(...) for the same value. Apple Silicon's hardware rsqrt has implementation-defined precision and can differ from 1.0f/sqrt by ~1 ULP. Across the 43 layers of DeepSeek V4 Flash that per-layer ULP drift compounds visibly, and the rounding gap between rsqrt and div+sqrt isn't guaranteed to match between M3/M4 and M5 hardware families. Switch the fused QKV norm to 1.0f/sqrt(...) so both norm kernels share a single formula. Gated by DS4_METAL_NORM_RSQRT_DISABLE so the rsqrt path can be A/B'd. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kernel_dsv4_kv_fp8_store_f32 deliberately writes the raw cache row as (float)((half)q) so its precision matches the half-typed FlashAttention KV buffer the indexer references. With DS4_METAL_KV_RAW_F32 set, the half cast is skipped and the FP8-dequantized FP32 value is written verbatim. This is diagnostic only: enabling it makes the indexer see higher- precision values than FlashAttention, which is a deliberate mismatch that reveals how much drift the FP16 quantization contributes but is not safe to ship. Default off. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Metal's pow(freq_base, k) is not IEEE-754 strict and the rounding can differ between GPU families. With DS4_METAL_ROPE_EXP2_LOG2 set, the RoPE angle is computed as exp2(k * log2(freq_base)) instead, using two primitives with tighter precision specifications. The change touches both the NeoX and default RoPE branches of kernel_dsv4_rope_tail_f32. Default off -- this is a diagnostic to quantify how much RoPE pow precision contributes to logprob drift on M5 Max relative to M3/M4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the macro un-defines DS4_METAL_HAS_TENSOR at library compile time the cooperative-tensor _mpp kernel templates are no longer in the library, but g_metal4_tensor_api_enabled was still truthy so the host dispatch layer kept attempting to fetch them. The result was a flood of "Metal kernel kernel_mul_mm_*_mpp_* function not found" warnings on the legacy fallback path. Flip g_metal4_tensor_api_enabled = 0 inside the same branch so the host code's ds4_gpu_use_mpp_*() and ds4_gpu_*_mpp_tensor() guards see the disabled state and skip _mpp lookups entirely. Measured on M5 Max with the short reasoning prompt: drift between -mt off and DS4_METAL_TENSOR_MATMUL_DISABLE=1 -mt auto is now exactly zero (rms=0, max_abs=0, max_rank_delta=0), confirming that the M5 Max logprob drift is sourced entirely in the Metal 4 cooperative-tensor matmul codepath and not in HC, norm, RoPE, or KV. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bisecting the M5 Max logprob drift on -mt auto:
- -mt off baseline: reference
- -mt auto (all routes): rms=0.150, max_abs=0.750, top20=0.263
- -mt auto + DS4_METAL_MPP_Q8_0_DISABLE=1: rms=0, max_abs=0 (exact)
- -mt auto + DS4_METAL_MPP_F16_DISABLE=1: still rms=0.150 (no help)
- -mt auto + DS4_METAL_MPP_ATTN_OUT_DISABLE=1: still rms=0.150
- -mt auto + DS4_METAL_MPP_MOE_{GATE,UP,DOWN}_DISABLE=1: still rms=0.150
The Metal 4 cooperative-tensor Q8_0 matmul (kernel_mul_mm_q8_0_f32_mpp
and direct_rhs variants in dense.metal) is the *sole* drift source on
M5 Max vs the legacy simdgroup_multiply_accumulate path. The other
Tensor routes (F16 compressor, attention-output low projection, routed
MoE gate/up/down) are bit-clean against -mt off.
Flip ds4_gpu_mpp_q8_0_default_target() to return 0 when the device
name contains "M5". Other Tensor routes continue to default on, so the
Q8_0 carve-out preserves the bulk of the Metal Tensor speedup (F16
compressor at layers 0-19, MoE at layers 20+, attn-out at layers 32-42).
Users who care more about prefill throughput than bit-equivalence can
opt back in with DS4_METAL_MPP_Q8_0_ENABLE=1.
Verified on M5 Max with default flags only: -mt auto now produces
exactly the -mt off logits (rms=0, max_abs=0, max_rank_delta=0,
same_top1=yes, top5_overlap=5/5, top20_overlap=20/20).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MTLCompileOptions.fastMathEnabled defaults to YES and Apple's headers explicitly note this "may violate the IEEE 754 standard". With safe math forced via MTLMathModeSafe (macOS 15+) or fastMathEnabled=NO (deprecated fallback), drift between -mt off and -mt auto on M5 Max shrinks ~4x (rms 0.150 -> 0.037, max_abs 0.75 -> 0.19) -- showing that fast-math optimizations applied differently across the two hardware paths were amplifying the underlying matmul2d divergence. Default OFF: enabling safe math also moves -mt off away from the fast-math production reference (rms=0.63 vs original fast-math baseline) so it isn't a drop-in fix. Useful as a diagnostic to localize remaining drift sources and as an option for users who prefer strict IEEE-754 semantics over fast-math speed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit (75f0930) added the M5 carve-out by editing ds4_gpu_mpp_q8_0_default_target(), but that helper was also being reused as the default-target for ds4_gpu_use_mpp_f16_compressor_matmul (line 1363) and for the verbose memory-report banner that prints mpp_f16 (line 2102). That coupled F16 compressor default-on/off to the Q8 carve-out, which is wrong: the per-route bisection showed F16 is bit-clean on M5; only Q8 needed to flip default-off. Introduce a dedicated ds4_gpu_mpp_f16_default_target() that always returns 1 and use it at the two F16 call sites. The Q8 helper keeps its M5 carve-out unchanged. Verified on M5 Max with default flags: -mt auto still produces zero drift vs -mt off (rms=0, max_abs=0, max_rank_delta=0), and the F16 compressor Tensor route is now back to default-on on M5 as intended. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
test_metal_q8_0_mpp_matmul_case() built the reference output by calling ds4_gpu_matmul_q8_0_tensor() after ds4_gpu_set_quality(false). The set_quality(false) call enables MPP routing, and the dispatcher at ds4_metal.m:6277 then routes to ds4_gpu_matmul_q8_0_mpp_tensor() when the MPP can_use gate passes. So on M5 with Metal 4 tensor API enabled, the "reference" was actually the MPP output, and the test compared the MPP kernel to itself -- the max_abs/rms numbers were always near zero and any divergence in the MPP kernel itself would not have been caught. Force ds4_gpu_set_quality(true) around the reference call so the dispatcher takes the legacy simdgroup_multiply_accumulate path, then restore set_quality(false) before invoking ds4_gpu_matmul_q8_0_mpp_tensor() directly for the candidate. The reference and candidate now exercise the two different code paths the test was originally meant to compare. Verified on M5 Max: ./ds4_test --metal-kernels still passes, meaning the M5 cooperative-tensor Q8 matmul agrees with the legacy path within the 0.10 max-abs kernel target on the test shapes. The systemic drift in -mt auto comes from many small matmul deltas compounding through 43 layers, not from any single kernel exceeding the per-call threshold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…bers
Two corrections triggered by another reviewer's audit:
1. The auto-suite description claimed "auto enables Q8_0 prefill ...";
on M5 that is no longer true now that 75f0930 defaults Q8_0 Tensor
off on M5. Reword the section so it lists F16 compressor, attn-out,
and MoE as the auto-enabled routes, then call out the M5 carve-out
for Q8_0 explicitly with the env-var opt-in.
2. Refresh worst-case suite numbers measured on the current branch
(codex/metal4-m5-drift-patches after the F16-coupling fix 78fa48f
and the test-self-reference fix 580e896) on M5 Max:
worst_rms = 0.169 (was documented ~= 0.170)
worst_top20_max_abs = 0.306 (was documented ~= 0.342)
worst_max_abs = 0.922
min_top5_overlap = 5/5
min_top20_overlap = 20/20 (was 19/20)
worst_rank_delta = 1
Three short fixtures (short_italian_fact, short_code_completion,
short_reasoning_plain) are now bit-exact (rms=0); the residual
drift is concentrated on the two long-context fixtures and comes
from the F16 compressor, attention-output, and routed-MoE Tensor
routes still being default-on, compounding small per-matmul
deltas through 43 layers.
The Q8_0 isolation paragraph also picks up the M5 default-off note
so the env-var docs stay consistent with the runtime behavior.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1a93973 to
49c1137
Compare
Status updateQuality perspectiveNow Tensor Metal Kernel and Standard Metal Kernel are very similar from mathematical perspective and generation equivalent, here the comparison of each vs --quality. Tensor Metal: Default Metal (Main branch) Here the test against the other M5 implementation in Swival ds4-m5 From eval perspective the new slower Tensor Metal Kernel has improved AIME2025 reaching 90% in Q2 (unbelievable!) PerformanceHere situation is clearly different from previous version, there is still a benefit but more in the range of ~15%.
@antirez I'll leave this as draft and I'll keep rebasing on main, I'll remove the drift prone Q8_0 Tensor route and all related tests, docs, etc. When you'll get your M5 Max gifted by super @audreyt I'm sure you'll be able to do even more. |




Summary
DS4_METAL_MPP_DISABLE=1,DS4_METAL_MOE_SUM6_DISABLE=1, staged routed MoE envs)Benchmarks
Prompt source:
README.md; command shape:./ds4 --prompt-file <prompt> -n 1 --nothink --ctx 32768; 3 repeats per target.8192-token routed MoE stage profile, current M5 default:
Disabling routed MoE MPP on the same profile drops prefill from 442.33 tok/s to 359.17 tok/s and raises gate/up/down to about 18.4-18.7 ms per layer.
Validation
make ds4 ds4_test./ds4_test --metal-kernels./ds4_test --long-context./ds4_test --logprob-vectorsNotes
upboundaries withgate=13were tested but failed long-context; gate/up stay paired at layer 13.DS4_METAL_MOE_MID_F32=1looked slightly faster in a noisy local check, but the result is too small to promote without a broader clean sweep.