Skip to content

Pooling VRAM between Macs#83

Open
leedrake5 wants to merge 7 commits into
antirez:mainfrom
leedrake5:feat/rpc-pipeline-parallel
Open

Pooling VRAM between Macs#83
leedrake5 wants to merge 7 commits into
antirez:mainfrom
leedrake5:feat/rpc-pipeline-parallel

Conversation

@leedrake5
Copy link
Copy Markdown

@leedrake5 leedrake5 commented May 12, 2026

I added a pipeline-parallel RPC: split DS-V4-Flash across two Macs

Performance on measured 2-Mac setup (M1 Ultra head + M3 Max tail, TB4 bridge, Q4):

no MTP:                         short ~11.5 t/s,  long ~11.5 t/s   -- baseline
batched MTP verify:    short 12.1 t/s,   long 10.1 t/s    -- a wash on long
adaptive prefetch:         short 13.2 t/s,   long  9.4 t/s    -- +15% short, -18% long

This adds an opt-in second-machine path for inference: a head process owns layers [0, L_mid) and ships the residual stream over TCP to a tail worker that owns [L_mid, DS4_N_LAYER) and runs the output head. The motivating use case is two M-series Macs with 128 GB each linked by Thunderbolt 4, so the Q4 quant (needs ~256 GB) can run pooled across both unified-memory pools.

Single-host operation is unchanged. With neither --rpc-peer nor a partial n_layer_end set, every code path falls through the original single-engine logic and existing tests are bit-identical.

Design

  • One TCP connection per head/tail pair (1:1).
  • Frame format: u32 length, u8 op, u16 reserved, payload; little-endian throughout.
  • Handshake validates model fingerprint (file size + first 32 bytes), DS4 shape constants, routed-quant bits, ctx_size, and the split point. A mismatch fails the connection with a human-readable reason rather than producing wrong output silently.
  • Decode hot path: head's local Metal encode runs head's layer slice and stops at cur_hc; that tensor (DS4_N_HC * DS4_N_EMBD * 4 bytes = 64 KB/token) goes to the tail. The tail imports it, skips embed, runs its slice, runs the output projection, and ships back DS4_N_VOCAB logits.
  • Prefill under RPC uses a batched OP_PREFILL_REQ that ships an entire chunk's batch_cur_hc per call (Phase 5a), ~19x faster than the original per-token fallback on long prompts. Single OP_REWIND op (Phase 5b) lets the head rewind the tail's KV by N positions without a full RESET.
  • MTP is supported (Phase 5c/d): the worker can load --mtp and produce drafts that ride along on every DECODE_REPLY. Verification is all-or-nothing in one batched VERIFY_BATCH RPC. Whether this is a perf win depends on hit rate; see "MTP / prefetch" below.
  • Disk KV writes are skipped under RPC (cross-host coordination is a follow-up).

Files

ds4.h: public surface for layer range, RPC peer config, residual export/import, batch residual export, shape accessors, logits accessor, model_path/mtp_path resolvers, MTP draft accessors.
ds4.c: engine + weights structs gain layer range fields; weights_bind and weights_validate_layout honor the range so a tail worker doesn't need head's token_embd and vice versa (tail still binds token_embd when --mtp is set, since MTP-head reads it); all five forward sites (Metal decode, Metal prefill in both branches, CPU decode, CPU prefill) gate their loop and the embed/output bookends on the range; weights_compute_byte_clusters maps only the relevant ranges of the GGUF into Metal (head's range is non-contiguous when token_embd lives at file end, so the head gets ~71 GiB + ~4 GiB as two disjoint clusters instead of mapping the whole 164 GB Q4 file); hard guard refuses to start if total mapped > 93.75% of physical RAM; instance lock is refcounted so a daisy-chain test can hold it twice in one process; ds4_engine_open dials and handshakes a peer when configured and captures the peer's MTP capability; ds4_session_eval_internal ships the residual and receives logits when a peer is attached, with an optional head-side prefetch path (see "MTP / prefetch"); ds4_session_{invalidate,rewind,free} drain any in-flight speculation before issuing the network op so request boundaries don't leak a pending DECODE_REPLY into the next request's RESET stream.
ds4_rpc.{h,c}: ~1500 lines of new wire code. Plain POSIX sockets, no threads. Ops: HELLO_CLIENT/SERVER (handshake), DECODE_REQ/REPLY, PREFILL_REQ/REPLY, RESET, REWIND, MTP_TRIM, VERIFY_BATCH, SHUTDOWN. Op dispatch on the tail uses MSG_PEEK so each *_recv helper can read its own header. SO_RCVTIMEO is 300s by default (override via DS4_RPC_RECV_TIMEOUT_SECS); the head closes its peer and NULLs rpc_peer on any RPC error so a tail crash surfaces as a clean error rather than indefinite hang.
ds4_rpc_worker.c: ~500 lines. ds4-rpc-worker binary: parses --layer-start/-end/-listen/--port/--ctx/--routed-quant-bits/--mtp [FILE]/--mtp-draft N/--mtp-margin F/--quant q2|q4; opens a partial-range engine; accepts one head connection; runs the serve loop with all opcodes dispatched. Exits cleanly on SHUTDOWN or when the head disconnects.
ds4_cli.c,
ds4_server.c: --rpc-peer host:port and --rpc-split L on both binaries; --quant q2|q4 to pick the canonical model file in ./gguf/. Default port 46434.
tests/ds4_test.c: new --pipeline-daisy-chain test. Opens three engines back-to-back in one process under the refcounted lock: full [0, 43), head [0, 21), tail [21, 43). Tokenizes "Hello", runs one decode on each, daisy-chains residual from head to tail, compares top logprobs. Tolerates small float drift from different Metal command-buffer split points; asserts top-3 token agreement and max logit diff < 0.1.
Makefile: ds4_rpc.o is part of CORE_OBJS; new ds4-rpc-worker target.

Single-host invariant

When opt->rpc_peer_host is unset and opt->n_layer_end normalizes to DS4_N_LAYER, every gated branch falls into the original code path:
- weights_bind requires every layer's tensors and the full output stack, same as before
- the per-layer forward loops iterate [0, DS4_N_LAYER)
- the embed step runs, the output head runs, logits are read
- ds4_session_eval_internal skips the RPC block
- ds4_session_sync takes the original prefill paths
./ds4_test --server and --metal-kernels pass.

MTP / prefetch (opt-in, env-gated)

The head's eval_internal has an optional Phase 6 path: speculatively run head's L0-L20 on the previous reply's drafts[0] and ship the spec request eagerly so the tail starts processing while the head is doing its sample-and-loop bookkeeping. On hit (~67% in measured data) the in-flight spec was right and head skips the real L0-L20 for the next token. On miss the head drains the spec reply, sends REWIND to roll the tail's KV back by one position, restores its own KV from a snapshot, and falls through to the synchronous path. A 32-cycle sliding window of hit/miss outcomes (Phase 6.7 adaptive) triggers a 32-cycle cooldown when hit rate drops below 50%, so worst-case workloads don't keep paying the miss tax.

Here's the performance on measured 2-Mac setup (M1 Ultra head + M3 Max tail, TB4 bridge, Q4):

phase                          prompt    prefill t/s    decode t/s   wall (s)
-----------------------------  ------    -----------    ----------   --------
per-token prefill         35            9.3          19.7       16.7
                                 7497            9.7           9.1      795.9
batched prefill             35           10.3          19.6       16.4
                                 7497          181.8          15.7       48.7
batched MTP verify   35            9.9          12.1       24.6
                                 7497          152.9          10.7       68.4
prefetch                    35            9.7          11.9       25.1
                                 7497          145.8           9.5       63.4
adaptive prefetch     35            9.7          13.2       23.0
                                 7497          127.8           9.4       70.9

Short = 35-token canned chat prompt, 256-token completion. Long = 7497-token prompt, 115-209 tokens of completion. Temperature 0; outputs are deterministic across runs. Decode t/s is the only knob the engine controls per token; prefill variance on the long row reflects warmup state (cold map + Metal residency), not algorithmic differences -- 5a's batched prefill landed the ~19x prefill win and that holds across 5d-6.

Not a fan of the MTP results. In general, prefetch wins when MTP's recursive predictions are accurate (chat, casual generation) and loses when they're not (code, technical text). Adaptive cooldown caps the damage but can't fully recover. An async-reader-thread variant (depth-1 with 2-step lookahead via drafts[1]) was attempted and rolled back -- the per-RPC threading overhead on macOS ate the theoretical 10% gain and added a 40% regression on short prompts. Maybe someone smarter than me can figure it out.

Operational rule:
- Interactive chat / short prompts: export DS4_RPC_PREFETCH=1
- Long-context analysis: unset DS4_RPC_PREFETCH
Same binaries either way.

--quant flag: pass q2 or q4 to specify the model. Default auto-detects the canonical file in ./gguf/ (prefers Q2 if both present).

Usage examples

Two-Mac setup (head on this box, tail on the MBP, TB4 bridge):

# On the tail (MBP), in ds4 repo root, with ./gguf/ populated:
./ds4-rpc-worker --layer-start 21 --layer-end 43 --port 7400 \
                 --quant q4 --ctx 65536 \
                 --mtp gguf/DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf --mtp-draft 2

# On the head, also in ds4 repo root:
./ds4-server --quant q4 --rpc-peer 169.254.48.177:7400 --rpc-split 21 \
             --mtp --ctx 65536
# (optionally prefix with DS4_RPC_PREFETCH=1 for short-prompt speedup)

Single-prompt run (file in, file out):

./ds4 --quant q4 --rpc-peer 169.254.48.177:7400 --rpc-split 21 \
      --ctx 65536 --prompt-file my_prompt.txt > my_output.txt

Or via the OpenAI-compatible server:

PROMPT=$(jq -Rs . < my_prompt.txt)
curl -sN http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d "{\"model\":\"deepseek-v4-flash\",\"messages\":[{\"role\":\"user\",\"content\":$PROMPT}],\"max_tokens\":8192,\"temperature\":0,\"thinking\":{\"type\":\"disabled\"}}" \
  | jq -r '.choices[0].message.content' > my_output.txt

Benchmarking:

python3 bench-ds4.py --long /tmp/long-bench-prompt.txt
# Prints prefill and decode t/s for a short canned prompt and one long prompt of choice.

Env vars (all opt-in):

DS4_RPC_PREFETCH=1          enable Phase 6 head speculation
DS4_RPC_SPEC_DEBUG=1        verbose per-cycle hit/miss/cooldown logging (stderr)
DS4_MTP_SPEC_LOG=1          per-decision MTP accept/miss log lines
DS4_RPC_RECV_TIMEOUT_SECS=N override the 300s socket recv timeout

leedrake5 added 7 commits May 11, 2026 09:54
This adds an opt-in second-machine path for inference: a head process owns layers [0, L_mid) and ships the residual stream over TCP to a tail worker that owns [L_mid, DS4_N_LAYER) and runs the output head. The motivating use case is two M-series Macs with 128 GB each linked by Thunderbolt 4, so the Q4 quant (needs ~256 GB) can run pooled across both unified-memory pools.

Single-host operation is unchanged. With neither --rpc-peer nor a partial n_layer_end set, every code path falls through the original single-engine logic and existing tests are bit-identical.

Design

  - One TCP connection per head/tail pair (1:1).
  - Frame format: u32 length, u8 op, u16 reserved, payload; little-endian throughout.
  - Handshake validates model fingerprint (file size + first 32 bytes), DS4 shape constants, routed-quant bits, and the split point.  A mismatch fails the connection with a human-readable reason rather than producing wrong output silently.
  - Decode hot path: head's local Metal encode runs head's layer slice and stops at cur_hc; that tensor (DS4_N_HC * DS4_N_EMBD = to the tail.  The tail imports it, skips embed, runs its slice, runs the output projection, and ships back DS4_N_VOCAB logits.
  - Prefill under RPC uses a per-token decode loop fallback (correct but O(n_tokens) round trips; ~0.5 s extra latency for ~1000-token prompts on TB4).  A real batch-prefill op is a Phase 5 follow-up.
  - MTP draft and disk KV writes are disabled under RPC for now; same reason -- they need their own wire ops.

Files

  ds4.h: public surface for layer range, RPC peer config, residual export/import, shape accessors, logits accessor.
  ds4.c: engine + weights structs gain layer range fields; weights_bind and weights_validate_layout honor the range so a tail worker doesn't need head's token_embd and vice versa; all five forward sites (Metal decode, Metal prefill in both branches, CPU decode, CPU prefill) gate their loop and the embed/output bookends on the range; instance lock is refcounted so a daisy-chain test (or the head+MTP draft pair) can hold it twice in one process; ds4_engine_open dials and handshakes a peer when configured; ds4_session_eval_internal ships the residual and receives logits when a peer is attached; ds4_session_{invalidate,rewind} propagate RESET; ds4_session_sync falls back to per-token under RPC for cold and warm paths.
  ds4_rpc.{h,c}: new files, ~550 lines.  Plain POSIX sockets, no threads.  Magic preamble, length-prefixed frames, config handshake, decode req/reply, reset req/reply, shutdown.  Op dispatch on the tail uses MSG_PEEK so each *_recv helper can read its own header.
  ds4_rpc_worker.c: new file, ~330 lines.  ds4-rpc-worker binary: parses --layer-start/-end/-listen/-port/--ctx/ --routed-quant-bits, opens a partial-range engine, accepts one head connection, runs the decode/reset/shutdown serve loop, exits cleanly on SHUTDOWN.
  ds4_cli.c,: --rpc-peer host:port and --rpc-split L on both
  ds4_server.c      binaries.  Default port 46434.
  tests/ds4_test.c: new --pipeline-daisy-chain test.  Opens three engines back-to-back in one process under the refcounted lock: full [0, 43), head [0, 21), tail [21, 43).  Tokenizes "Hello", runs one decode on each, daisy-chains residual from head to tail, compares top logprobs.  Tolerates small float drift from different Metal command-buffer split points; asserts top-3 token agreement and max logit diff < 0.1.
  Makefile: ds4_rpc.o is part of CORE_OBJS; new ds4-rpc-worker target.

Single-host invariant

  When opt->rpc_peer_host is unset and opt->n_layer_end normalizes
  to DS4_N_LAYER, every gated branch falls into the original code
  path:
    - weights_bind requires every layer's tensors and the full output stack, same as before
    - the per-layer forward loops iterate [0, DS4_N_LAYER)
    - the embed step runs, the output head runs, logits are read
    - ds4_session_eval_internal skips the RPC block
    - ds4_session_sync takes the original prefill paths
  ./ds4_test --server and --metal-kernels pass.  --logprob-vectors and --tool-call-quality are good additional gates once the global ds4 instance lock is available (an external ds4-server holds it on the dev box right now).

--quant flag: pass q2 or q4 to specify the model
Adding --quant flags to the rpc worker
Can pool share across a thunderbolt bridge
Works, but not the best time savings.
@antirez
Copy link
Copy Markdown
Owner

antirez commented May 12, 2026

I find this interesting but will have to consider it carefully, thanks for providing the work.
However: did you saw the benchmarks in the FP4 issue? It looks like 2 bit provides a lot of the value of the 4 bit quants.

@leedrake5
Copy link
Copy Markdown
Author

I find this interesting but will have to consider it carefully, thanks for providing the work. However: did you saw the benchmarks in the FP4 issue? It looks like 2 bit provides a lot of the value of the 4 bit quants.

Any tests you recommend I run to evaluate Q2 vs. Q4 quality? I agree the Q2 is great and quick. My hope was that Q4 would be a sizable jump in quality worth the effort of RPC implementation - running a literary criticism test now on a doc to evaluate for my own purposes. But can easily throw some riddles or other logic questions to try and shake out the difference between the two if you have any in mind.

@leedrake5
Copy link
Copy Markdown
Author

Going through some tests, Q4 > Q2 on logic, but so far the two are quite comparable for coding and more qualitative work. Here's the logic question:

Six paintings labeled A, B, C, D, E, F are hung in a single row, positions 1 (leftmost) through 6 (rightmost). Each painting occupies exactly one position; positions are not repeated.

Constraints:

  1. A is strictly to the right of D but strictly to the left of E.
  2. F is strictly to the right of B.
  3. The painting immediately to the right of C is either A or E (so C cannot be in position 6).
  4. D and A are not adjacent (their positions differ by more than 1).
  5. B is at one end of the row (position 1 or position 6).

Determine the exact left-to-right ordering. Reason one position at a time: each step must cite which numbered constraint forces that placement. Where a step requires case analysis, walk through every case and show where the contradiction arises. Do not guess; do not skip steps. Conclude with the final ordering as a single line, e.g. "Final ordering: X Y Z W V U".

Both correctly deduced B=1 from constraints 2+5 and started enumerating D/A/E triples. But that's where the similarity ends.

  • Q2 found one valid ordering (B,D,C,A,E,F)
  • Q4 systematically enumerated all four valid (D,A,E) triples, found multiple valid orderings (B,D,C,A,F,E and B,D,F,A,C,E from the (2,4,6) triple alone), and was still going when it hit the limit. Q4 correctly identified that the puzzle is underspecified — it has 5 valid solutions, not 1.

So my conclusion is that I'm as smart as Q2, but I'm dumber than Q4. Faced with an ambiguous problem, Q2 stopped at "I found one that works" while Q4 kept verifying uniqueness. So there is value in pushing for Q4, even with half the throughput due to thunderbolt constraints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants