Skip to content

feat(loader): support stock-recipe (Q8_0/F32) abliterated GGUFs end-to-end on Metal#60

Closed
audreyt wants to merge 8 commits into
antirez:mainfrom
audreyt:support-q8_0-token-embd
Closed

feat(loader): support stock-recipe (Q8_0/F32) abliterated GGUFs end-to-end on Metal#60
audreyt wants to merge 8 commits into
antirez:mainfrom
audreyt:support-q8_0-token-embd

Conversation

@audreyt
Copy link
Copy Markdown

@audreyt audreyt commented May 10, 2026

What this changes

DeepSeek-V4-Flash GGUFs produced by the upstream llama.cpp converter without
per-tensor type overrides ship most of the small projections at Q8_0 (and the
routed-expert router at F32) where the antirez recipe keeps them at F16.

Examples: the cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF
models. On stock ds4 main these load fails at the first F16-strict
validator (token_embd, then output_hc_fn, then hc_attn_fn, …), and even
after the validators are relaxed, several Metal kernel paths read weight bytes
directly via offset arithmetic that hard-codes F16/F32 strides and produce
silently wrong output for Q8_0.

This PR makes the embed / HC / compressor / indexer / router validators and
the corresponding Metal kernel paths polymorphic, so the same GGUF loads and
runs on Metal end-to-end on audreyt/pi-ds4.

Validators (ds4.c)

  • New tensor_expect_dispatch_layout helper accepts F16, F32, or Q8_0 and is
    applied to every projection that flows through a type-dispatching
    matvec/matmul: output_hc_fn, hc_attn_fn, hc_ffn_fn,
    attn_compressor_{ape,gate,kv}, indexer.{attn_q_b,proj},
    indexer_compressor_{ape,gate,kv}, ffn_gate_inp.
  • token_embd keeps its own inline F16/Q8_0 check because its CPU embed
    kernel doesn't go through matvec_any.
  • The two compressor decode-time guards (attn_compressor and
    indexer_compressor pair-projection paths) relaxed from "F16 only" to
    "F16 or Q8_0, paired type must match".

CPU paths (ds4.c)

  • Refactor embed_token_f16 into an embed_token dispatcher; add
    embed_token_q8_0 (block-wise dequant of block_q8_0).
  • Replace the remaining direct matvec_f16 / matvec_f16_serial callers
    (HC fn, output_hc_fn, ffn_gate_inp) with the existing matvec_any
    dispatcher; add matvec_any_serial for the HC pre/post path.
  • Polymorphic Metal-side dispatch helpers metal_graph_matmul_plain_tensor
    and metal_graph_matmul_pair_plain_tensor (extended for Q8_0; the pair
    fuses with the existing F16-pair kernel when both tensors are F16,
    otherwise dispatches to two single matmuls). All 22 hardcoded
    ds4_metal_matmul_f16{,_pair}_tensor call sites in ds4.c (HC mix,
    attn/indexer compressors, indexer projections, output head, router)
    converted to use these wrappers.

Metal kernels

  • metal/get_rows.metal: kernel_get_rows_q8_0 (one float per thread,
    dequantizes its source block on the fly).
  • metal/dense.metal: kernel_mul_mm_f32_f32 template instantiation for the
    multi-token F32 weight matmul that the F32 router path needs in prefill
    (mirrors the existing F16/Q8_0 mul_mm_t instantiations).
  • metal/dsv4_kv.metal: a Q8_0 branch added to
    kernel_dsv4_compressor_store_one. Without this, the decode-time
    single-row compressor store treats Q8_0 ape as F32 and reads garbage.

Metal wiring (ds4_metal.m)

  • Register g_get_rows_q8_0_pipeline at init; clear at cleanup.
  • Both ds4_metal_embed_{token,tokens}_hc_tensor and the shared
    ds4_metal_encode_get_rows helper take a new weight_type parameter
    (GGUF type code: 1=F16, 8=Q8_0). 8 callers in ds4.c forward
    weights->token_embd->type unchanged. ds4_metal_embed_row_layout picks
    the right per-row stride and pipeline.
  • ds4_metal_matmul_f32_tensor extended with a multi-token branch that
    dispatches to kernel_mul_mm_f32_f32 (n_tok > 1); existing n_tok = 1
    path unchanged.
  • ds4_metal_encode_compressor_score_with_ape and the equivalent loop in
    ds4_metal_compressor_store_batch_tensor: for Q8_0 ape, dequantize on the
    CPU into a per-call private MTLBuffer and feed that into the existing
    add_f32_1d. Two new helpers (ds4_metal_half_bits_to_float,
    ds4_metal_cpu_dequant_q8_0_rows) implement the conversion; the CPU
    dequant matches gguf-py's dequantize reference byte-for-byte (verified
    in a standalone numeric check). A per-call buffer is required because
    multiple CPU writes to the previously-shared
    g_compressor_store_ape_buffer within one command buffer collapse to the
    last write at execute time (Metal kernels run in encode order, but CPU
    writes don't participate in that ordering when the same scratch is
    reused). The per-call buffer is retained until cb completion via
    addCompletedHandler because Metal does not strongly retain buffers
    bound to encoders.
  • Six ape_type validators relaxed to also accept 8 (Q8_0).
  • Six ape_bytes calculations centralized through a new
    ds4_metal_ape_bytes(ape_type, n_elems) helper that returns the correct
    stride for F16 / F32 / Q8_0.
  • metal_graph_matmul_plain_tensor extended with a Q8_0 branch.

Why CPU dequant for Q8_0 ape (and not a Metal kernel)

I first wrote a kernel_cpy_q8_0_f32 Metal kernel using the same
block_q8_0 indexing pattern that the working dense Q8_0 matvec/matmul
kernels in metal/dense.metal use. It compiled cleanly but produced
silently wrong values for the actual compressor APE shapes (4 rows × 1024
cols of block_q8_0). I confirmed this by a side-by-side numeric check
against gguf-py's dequantize reference: my CPU dequant matches
byte-for-byte; the Metal kernel does not. I left kernel_cpy_q8_0_f32 in
metal/cpy.metal (its registration in ds4_metal.m is harmless) so that a
future debug session can pick it up; the compressor paths use the CPU
dequant as the active route.

What this PR does not cover

  • The CPU MoE path (ds4.c:5198, 5291) still hardcodes
    if (gate_w->type != IQ2_XXS) ds4_die(...). That path is reference/debug
    per AGENT.md and the production Metal flow doesn't touch it. If
    something forces CPU fallback (Metal disabled, MTP without Metal, certain
    trace modes) on a stock-recipe Q8_0 GGUF you'll see "expected IQ2_XXS
    expert tensors" and need a Q8_0 dispatch added there too. Out of scope
    for this PR; the production Metal flow is fine.
  • No quantization changes, no recipe changes, no new GGUF formats. This is
    a loader/dispatcher change so existing GGUFs that happen to use the
    stock recipe become loadable.

Test matrix (macOS / M5/ Metal)

  • make ds4-server clean (one pre-existing -Wpointer-sign warning from
    the unrelated MoE path, not introduced by this PR).
  • Cyberneurova Q2_K GGUF entirely unmodified, default flags:
    21-token prompt → coherent generation
    ("An LLM, or Large Language Model, is a type of artificial intelligence").
    Without the compressor APE fix, this prompt generated a few coherent
    tokens then <BOS> token spam.
  • Pre-harmonized variant (token_embd / HC / compressor / indexer all
    F16): still works byte-for-byte the same as before, no F16/F32 path
    regressions.
  • make ds4-server build clean across both branches.

audreyt added 2 commits May 10, 2026 06:22
DeepSeek-V4-Flash GGUFs produced by the upstream llama.cpp converter
without per-tensor type overrides ship most of the small projections at
Q8_0 (and routed-expert router weights at F32) where the antirez recipe
keeps them at F16. Examples include the cyberneurova abliterated GGUFs.
On stock ds4 main these load fails loudly at the first F16-strict
validator (token_embd, then output_hc_fn, then hc_attn_fn, ...), and
even after the validators are relaxed, several Metal kernel paths read
weight bytes directly via offset arithmetic that hard-codes F16/F32
strides.

This change makes the embed/HC/compressor/indexer/router validators
*and* the corresponding Metal kernel paths polymorphic, so the same
GGUF loads and runs with no harmonizer step.

Validators (ds4.c):

  * New tensor_expect_dispatch_layout helper accepts F16, F32, or Q8_0
    and is applied to every projection that flows through a
    type-dispatching matvec/matmul: output_hc_fn, hc_attn_fn,
    hc_ffn_fn, attn_compressor_{ape,gate,kv}, indexer.{attn_q_b,proj},
    indexer_compressor_{ape,gate,kv}, ffn_gate_inp.
  * token_embd keeps its own inline F16/Q8_0 check because its CPU
    embed kernel doesn't go through matvec_any.
  * Two compressor decode-time guards (attn_compressor and
    indexer_compressor pair-projection paths) relaxed from "F16 only"
    to "F16 or Q8_0, paired type must match".

CPU paths (ds4.c):

  * Refactor embed_token_f16 into an embed_token dispatcher; add
    embed_token_q8_0 (block-wise dequant of block_q8_0).
  * Replace the remaining direct matvec_f16 / matvec_f16_serial
    callers (HC fn, output_hc_fn, ffn_gate_inp) with the existing
    matvec_any dispatcher; add matvec_any_serial for the HC pre/post
    path.
  * Polymorphic Metal-side dispatch helpers metal_graph_matmul_plain_tensor
    and metal_graph_matmul_pair_plain_tensor (extended for Q8_0; the
    pair fuses with the existing F16-pair kernel when both tensors are
    F16, otherwise dispatches to two single matmuls). All 22 hardcoded
    ds4_metal_matmul_f16{,_pair}_tensor call sites in ds4.c (HC mix,
    attn/indexer compressors, indexer projections, output head, router)
    converted to use these wrappers.

Metal kernels:

  * metal/get_rows.metal: kernel_get_rows_q8_0 (one float per thread,
    dequantizes its source block on the fly).
  * metal/dense.metal: kernel_mul_mm_f32_f32 template instantiation for
    the multi-token F32 weight matmul that the F32 router path needs in
    prefill (mirrors the existing F16/Q8_0 mul_mm_t instantiations).
  * metal/cpy.metal: kernel_cpy_q8_0_f32 (dequantizing 1D copy used by
    the compressor APE byte-strided reader).

Metal wiring (ds4_metal.m):

  * Register g_get_rows_q8_0_pipeline and g_cpy_q8_0_f32_pipeline at
    init; clear them at cleanup.
  * Both ds4_metal_embed_{token,tokens}_hc_tensor and the shared
    ds4_metal_encode_get_rows helper take a new weight_type parameter
    (GGUF type code: 1=F16, 8=Q8_0). 8 callers in ds4.c forward
    weights->token_embd->type unchanged. ds4_metal_embed_row_layout
    picks the right per-row stride and pipeline.
  * ds4_metal_matmul_f32_tensor extended with a multi-token branch
    that dispatches to kernel_mul_mm_f32_f32 (n_tok > 1); existing
    n_tok = 1 path unchanged.
  * ds4_metal_encode_compressor_score_with_ape and the equivalent loop
    in ds4_metal_compressor_prefill_tensor add a Q8_0 branch
    (ds4_metal_encode_cpy_q8_0_f32_1d) and use a per-row stride that
    accounts for the block_q8_0 layout.
  * Six ape_type validators relaxed to also accept 8 (Q8_0).
  * Six ape_bytes calculations centralized through a new
    ds4_metal_ape_bytes(ape_type, n_elems) helper that returns the
    correct stride for F16/F32/Q8_0.
  * metal_graph_matmul_plain_tensor extended with a Q8_0 branch.

Tested on macOS / M-series / Metal:

  * make ds4-server clean (no new warnings).
  * Cyberneurova Q2_K GGUF entirely unmodified: loads, prefill +
    decode through to coherent generation ("PASS" returned for the
    "reply with the single word PASS" prompt).
  * Pre-harmonized variant (token_embd / hc / compressor / indexer all
    F16, ffn_gate_inp F16): still works byte-for-byte the same as
    before this change, no F16 path regressions.

Caveat for reviewers running ivanfioravanti's M5 PR (antirez#15) on top of
this: the unmodified cyberneurova file generates garbage (BOS spam)
when MPP F16 prefill is engaged, but produces coherent output with
DS4_METAL_MPP_F16_DISABLE=1. The garbage is reproducible from antirez#15's
MPP path alone and is independent of the changes here; it surfaces only
because this PR makes the Q8_0 file loadable in the first place.
This PR's loader changes accept Q8_0 `*compressor_ape*` weights at the
validator level, but two follow-on Metal paths still treat them as F16
(or fall through to F32) and produce silently wrong output, which shows
up as <BOS>-token spam in generation for any prompt long enough to
exercise the multi-token compressor path on M-series hardware.

1. `kernel_cpy_q8_0_f32` (added in this PR for the prefill APE
   byte-strided dequant) compiles cleanly and follows the same
   block_q8_0 indexing pattern used by other working Q8_0 kernels in
   dense.metal, but emits silently wrong values for the actual ape
   shapes (4 rows x 1024 cols of block_q8_0).  Confirmed by isolating
   the kernel: a CPU-side dequant of the same byte region matches
   gguf-py's `dequantize` reference byte-for-byte, while the Metal
   kernel's output is wrong.

2. `kernel_dsv4_compressor_store_one` (decode-time single-row store
   in metal/dsv4_kv.metal): only handled `ape_type == 1` (F16) and
   fell through to F32 for everything else, so Q8_0 ape was reading
   garbage at decode time.

Fix:

* Replace the prefill APE Q8_0 path in
  `ds4_metal_encode_compressor_score_with_ape` and
  `ds4_metal_compressor_store_batch_tensor` with a CPU-side dequant
  via two new helpers (`ds4_metal_half_bits_to_float` and
  `ds4_metal_cpu_dequant_q8_0_rows`) into a *per-call* private
  MTLBuffer.  A per-call buffer is required because multiple CPU writes
  to the previously-shared `g_compressor_store_ape_buffer` within one
  command buffer collapse to the last write at execute time (Metal
  kernels run in encode order, but CPU writes don't participate in that
  ordering when the same scratch is reused).  The per-call buffer is
  retained until cb completion via `addCompletedHandler` because Metal
  does not strongly retain buffers bound to encoders.
* Add a Q8_0 branch to `kernel_dsv4_compressor_store_one` that walks
  block_q8_0 layout (uint16_t scale + 32 int8 quants per 34-byte block)
  inline.

The buggy `kernel_cpy_q8_0_f32` Metal kernel is left in place but is
no longer reached from the compressor paths; its registration in
ds4_metal.m is harmless and a future debug session can either fix it
or drop it.

Tested on macOS / M-series / Metal:

* make ds4-server clean (one pre-existing -Wpointer-sign warning from
  the unrelated MoE path).
* Cyberneurova Q2_K GGUF entirely unmodified, default flags:
  21-token prompt -> coherent generation
  ("An LLM, or Large Language Model, is a type of artificial intelligence").
  Previously this prompt generated a few coherent tokens then <BOS>
  token spam.
* Pre-harmonized variant (token_embd / hc / compressor / indexer all
  F16): still works byte-for-byte the same as before this fix; no F16
  / F32 path regressions.
@audreyt audreyt marked this pull request as draft May 10, 2026 12:36
@fry69

This comment was marked as resolved.

@audreyt audreyt marked this pull request as ready for review May 10, 2026 18:13
The decode-time indexer code at metal_graph_encode_decode_layer (ds4.c:9082-9095)
still has two F16-only validators on indexer_attn_q_b and indexer_proj that I
missed in the initial loader pass.

These validators only fire after `g->layer_n_comp[il] > decode_top_k` — i.e.
once the compressor has accumulated more rows than the decode-time top-k.
For short generations the path isn't reached; for ~400+ token generations
on stock-recipe (Q8_0) GGUFs the validator trips and the request finishes
with finish_reason="error" / "Metal decode failed".

The downstream calls already use metal_graph_matmul_plain_tensor (which
dispatches to ds4_metal_matmul_q8_0_tensor for Q8_0). The loader-time
validator at line 2211-2212 already uses tensor_expect_dispatch_layout,
which accepts F16/F32/Q8_0. Only these runtime guards were stuck on F16.

Reproducer (cyberneurova Q2_K, default flags): a "write a long story"
prompt that generates ~800 tokens hits the validator after ~400 tokens
and the request errors out. After this fix, the same prompt streams 800+
tokens cleanly.
@audreyt
Copy link
Copy Markdown
Author

audreyt commented May 10, 2026

ds4: Metal graph indexer q projection expects F16 weights

Fixed in c2144e5!

@audreyt audreyt changed the title feat(loader): support stock-recipe (Q8_0/F32) GGUFs end-to-end on Metal feat(loader): support stock-recipe (Q8_0/F32) abliterated GGUFs end-to-end on Metal May 10, 2026
@fry69

This comment was marked as resolved.

audreyt and others added 2 commits May 10, 2026 14:44
The two callers of ds4_metal_encode_cpy_q8_0_f32_1d were removed in 79b08bb
(switched to CPU-side dequant to avoid an encode-time race on the shared
compressor scratch buffer), leaving the function unused and tripping
-Wunused-function on stock Make builds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@audreyt
Copy link
Copy Markdown
Author

audreyt commented May 10, 2026

There is a small likely cosmetic warning while compiling:

Fixed and synced from main. Ready for review from @antirez.

@fry69

This comment was marked as resolved.

@audreyt
Copy link
Copy Markdown
Author

audreyt commented May 11, 2026

I cannot rebase this PR branch cleanly against current main (ae302c2).

It works for me — would you like to try audreyt:main?

@fry69

This comment was marked as resolved.

@audreyt
Copy link
Copy Markdown
Author

audreyt commented May 11, 2026

Apologies for the confusion — yes, my main is current, but gh pr checkout 60 && git rebase main isn't quite the right path to it, and that's on me for not being more explicit.

The PR branch head (147b0d7) already merges current origin/main (ae302c2, "Project renamed to DwarfStar 4."); the three merge commits in the PR history are where I resolved the conflicts from the recent backend refactor (ds4_metal.hds4_gpu.h, CUDA support landing). git rebase main discards those merges and replays the original 24022c2 (written against the pre-refactor layout) onto today's main, which reintroduces every conflict I already resolved.

So the branch as gh pr checkout 60 leaves it should just build:

gh pr checkout 60
make ds4-server

Thank you for taking the time to test this — really appreciate the careful repro.

@fry69
Copy link
Copy Markdown

fry69 commented May 11, 2026

Ah, sorry for this confusion.

Merge commits always get me, I forget about them, sorry again.

@antirez antirez added the weights loading or handling New feature or request label May 11, 2026
@fry69

This comment was marked as resolved.

@audreyt audreyt force-pushed the support-q8_0-token-embd branch from 63cb942 to e8f5bdc Compare May 12, 2026 12:48
@audreyt
Copy link
Copy Markdown
Author

audreyt commented May 12, 2026

I do not know if it is a good idea to widen this PR, e.g. incorporating unrelated things like the Responses API or the count_tokens endpoint.

Indeed, thank you for the catch. Now factored out to #90 and #91 PRs.

@audreyt
Copy link
Copy Markdown
Author

audreyt commented May 13, 2026

Closing — the motivating case (loading the cyberneurova abliterated stock-recipe Q2_K GGUF on main Metal) is now resolved at the model layer instead of the loader layer.

I've published https://huggingface.co/audreyt/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF#audrey-tang-ds4-re-quants — antirez-style q2-imatrix and q4-imatrix re-quants of the cyberneurova abliterated weights, with F16 token embedding so the stock ds4 loader takes them directly, no support-q8_0-token-embd workaround needed. The recipe matches ds4flash.gguf byte-for-byte except for the abliterated routed experts.

This keeps the loader narrow (one canonical recipe, matching the README's "intentionally narrow" framing) and pushes behavioral surface — abliteration character, hedge register, etc. — onto the model and runtime layers (re-quants for the model side, dir-steering for the runtime side). Closing #60 makes that separation explicit and avoids carrying a 600+-line loader/Metal patch through every backend refactor (compressor APE, M5 simdgroup, MTL4) when the use case it addressed has evaporated.

audreyt/pi-ds4 has switched its download_model.sh to the new IQ2XXS-w2Q2K imatrix re-quant (~87 GB) and now runs end-to-end on stock main minus only ivanfioravanti's PR #15.

Big thanks to @fry69 for the careful reproductions on both the long-prompt indexer crash and the rebase-conflict reports — you kept this honest.

@audreyt audreyt closed this May 13, 2026
@fry69
Copy link
Copy Markdown

fry69 commented May 13, 2026

I can confirm that his model works perfectly with current main branch:

$ cd gguf
$ hf download audreyt/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF \
  cyberneurova-DeepSeek-V4-Flash-abliterated-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf \
  --local-dir .
$ cd ..
$ rm -f tmp/ds4-kv/* && ./ds4-server --ctx 100000 --kv-disk-dir ./tmp/ds4-kv --kv-disk-space-mb 8192 --port 28000 -m ./gguf/cyberneurova-DeepSeek-V4-Flash-abliterated-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf 
ds4: Metal device Apple M5 Max, 128.00 GiB RAM
ds4: requesting Metal residency (may take tens of seconds)... done
ds4: warming Metal model views... done
ds4: Metal model views created in 2.206 ms, residency requested in 8366.213 ms, warmup 3.422 ms (mapped 82697.67 MiB from offset 5.08 MiB)
ds4: Metal mapped mmaped model as 2 overlapping shared buffers
ds4: metal backend initialized for graph diagnostics
0513 07:18:45 ds4-server: context buffers 1896.58 MiB (ctx=100000, backend=metal, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=25002)
0513 07:18:45 ds4-server: KV disk cache ./tmp/ds4-kv (budget=8192 MiB, cross-quant=accept, min=512, cold_max=30000, continued=10000, trim=32, align=2048)
0513 07:18:45 ds4-server: listening on http://127.0.0.1:28000
0513 07:19:12 ds4-server: chat ctx=0..1205:1205 prompt start
0513 07:19:16 ds4-server: chat ctx=0..1205:1205 prompt done 4.160s
0513 07:19:16 ds4-server: kv cache stored tokens=1205 trimmed=0 reason=cold size=38.68 MiB save=16.6 ms
0513 07:19:17 ds4-server: chat ctx=0..1205:1205 gen=50 THINKING decoding chunk=36.45 t/s avg=36.45 t/s 1.372s
[...]
0513 07:19:27 ds4-server: chat ctx=0..1205:1205 gen=400 THINKING decoding chunk=37.40 t/s avg=37.23 t/s 10.744s
0513 07:19:28 ds4-server: chat ctx=0..1205:1205 gen=450 decoding chunk=37.22 t/s avg=37.23 t/s 12.088s
[...]
0513 07:21:40 ds4-server: chat ctx=0..1205:1205 gen=4551 decoding chunk=28.88 t/s avg=31.52 t/s 144.368s
0513 07:21:40 ds4-server: thinking checkpoint canonicalization needs rebuild ctx=0..1205:1205 common=1204 live=5756 canonical=5345 reason="rewrite needs rebuild: common=1204 live=5756 canonical=5345"
0513 07:21:57 ds4-server: thinking checkpoint canonicalized ctx=0..1205:1205 common=1204 live=5756 canonical=5345 via=rebuild
0513 07:21:57 ds4-server: chat ctx=0..1205:1205 gen=4551 finish=stop 165.557s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

weights loading or handling New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants