Pooling VRAM between Macs#83
Conversation
This adds an opt-in second-machine path for inference: a head process owns layers [0, L_mid) and ships the residual stream over TCP to a tail worker that owns [L_mid, DS4_N_LAYER) and runs the output head. The motivating use case is two M-series Macs with 128 GB each linked by Thunderbolt 4, so the Q4 quant (needs ~256 GB) can run pooled across both unified-memory pools.
Single-host operation is unchanged. With neither --rpc-peer nor a partial n_layer_end set, every code path falls through the original single-engine logic and existing tests are bit-identical.
Design
- One TCP connection per head/tail pair (1:1).
- Frame format: u32 length, u8 op, u16 reserved, payload; little-endian throughout.
- Handshake validates model fingerprint (file size + first 32 bytes), DS4 shape constants, routed-quant bits, and the split point. A mismatch fails the connection with a human-readable reason rather than producing wrong output silently.
- Decode hot path: head's local Metal encode runs head's layer slice and stops at cur_hc; that tensor (DS4_N_HC * DS4_N_EMBD = to the tail. The tail imports it, skips embed, runs its slice, runs the output projection, and ships back DS4_N_VOCAB logits.
- Prefill under RPC uses a per-token decode loop fallback (correct but O(n_tokens) round trips; ~0.5 s extra latency for ~1000-token prompts on TB4). A real batch-prefill op is a Phase 5 follow-up.
- MTP draft and disk KV writes are disabled under RPC for now; same reason -- they need their own wire ops.
Files
ds4.h: public surface for layer range, RPC peer config, residual export/import, shape accessors, logits accessor.
ds4.c: engine + weights structs gain layer range fields; weights_bind and weights_validate_layout honor the range so a tail worker doesn't need head's token_embd and vice versa; all five forward sites (Metal decode, Metal prefill in both branches, CPU decode, CPU prefill) gate their loop and the embed/output bookends on the range; instance lock is refcounted so a daisy-chain test (or the head+MTP draft pair) can hold it twice in one process; ds4_engine_open dials and handshakes a peer when configured; ds4_session_eval_internal ships the residual and receives logits when a peer is attached; ds4_session_{invalidate,rewind} propagate RESET; ds4_session_sync falls back to per-token under RPC for cold and warm paths.
ds4_rpc.{h,c}: new files, ~550 lines. Plain POSIX sockets, no threads. Magic preamble, length-prefixed frames, config handshake, decode req/reply, reset req/reply, shutdown. Op dispatch on the tail uses MSG_PEEK so each *_recv helper can read its own header.
ds4_rpc_worker.c: new file, ~330 lines. ds4-rpc-worker binary: parses --layer-start/-end/-listen/-port/--ctx/ --routed-quant-bits, opens a partial-range engine, accepts one head connection, runs the decode/reset/shutdown serve loop, exits cleanly on SHUTDOWN.
ds4_cli.c,: --rpc-peer host:port and --rpc-split L on both
ds4_server.c binaries. Default port 46434.
tests/ds4_test.c: new --pipeline-daisy-chain test. Opens three engines back-to-back in one process under the refcounted lock: full [0, 43), head [0, 21), tail [21, 43). Tokenizes "Hello", runs one decode on each, daisy-chains residual from head to tail, compares top logprobs. Tolerates small float drift from different Metal command-buffer split points; asserts top-3 token agreement and max logit diff < 0.1.
Makefile: ds4_rpc.o is part of CORE_OBJS; new ds4-rpc-worker target.
Single-host invariant
When opt->rpc_peer_host is unset and opt->n_layer_end normalizes
to DS4_N_LAYER, every gated branch falls into the original code
path:
- weights_bind requires every layer's tensors and the full output stack, same as before
- the per-layer forward loops iterate [0, DS4_N_LAYER)
- the embed step runs, the output head runs, logits are read
- ds4_session_eval_internal skips the RPC block
- ds4_session_sync takes the original prefill paths
./ds4_test --server and --metal-kernels pass. --logprob-vectors and --tool-call-quality are good additional gates once the global ds4 instance lock is available (an external ds4-server holds it on the dev box right now).
--quant flag: pass q2 or q4 to specify the model
Can pool share across a thunderbolt bridge
Works, but not the best time savings.
|
I find this interesting but will have to consider it carefully, thanks for providing the work. |
Any tests you recommend I run to evaluate Q2 vs. Q4 quality? I agree the Q2 is great and quick. My hope was that Q4 would be a sizable jump in quality worth the effort of RPC implementation - running a literary criticism test now on a doc to evaluate for my own purposes. But can easily throw some riddles or other logic questions to try and shake out the difference between the two if you have any in mind. |
|
Going through some tests, Q4 > Q2 on logic, but so far the two are quite comparable for coding and more qualitative work. Here's the logic question:
Both correctly deduced B=1 from constraints 2+5 and started enumerating D/A/E triples. But that's where the similarity ends.
So my conclusion is that I'm as smart as Q2, but I'm dumber than Q4. Faced with an ambiguous problem, Q2 stopped at "I found one that works" while Q4 kept verifying uniqueness. So there is value in pushing for Q4, even with half the throughput due to thunderbolt constraints. |
I added a pipeline-parallel RPC: split DS-V4-Flash across two Macs
Performance on measured 2-Mac setup (M1 Ultra head + M3 Max tail, TB4 bridge, Q4):
This adds an opt-in second-machine path for inference: a head process owns layers [0, L_mid) and ships the residual stream over TCP to a tail worker that owns [L_mid, DS4_N_LAYER) and runs the output head. The motivating use case is two M-series Macs with 128 GB each linked by Thunderbolt 4, so the Q4 quant (needs ~256 GB) can run pooled across both unified-memory pools.
Single-host operation is unchanged. With neither --rpc-peer nor a partial n_layer_end set, every code path falls through the original single-engine logic and existing tests are bit-identical.
Design
Files
ds4.h: public surface for layer range, RPC peer config, residual export/import, batch residual export, shape accessors, logits accessor, model_path/mtp_path resolvers, MTP draft accessors.
ds4.c: engine + weights structs gain layer range fields; weights_bind and weights_validate_layout honor the range so a tail worker doesn't need head's token_embd and vice versa (tail still binds token_embd when --mtp is set, since MTP-head reads it); all five forward sites (Metal decode, Metal prefill in both branches, CPU decode, CPU prefill) gate their loop and the embed/output bookends on the range; weights_compute_byte_clusters maps only the relevant ranges of the GGUF into Metal (head's range is non-contiguous when token_embd lives at file end, so the head gets ~71 GiB + ~4 GiB as two disjoint clusters instead of mapping the whole 164 GB Q4 file); hard guard refuses to start if total mapped > 93.75% of physical RAM; instance lock is refcounted so a daisy-chain test can hold it twice in one process; ds4_engine_open dials and handshakes a peer when configured and captures the peer's MTP capability; ds4_session_eval_internal ships the residual and receives logits when a peer is attached, with an optional head-side prefetch path (see "MTP / prefetch"); ds4_session_{invalidate,rewind,free} drain any in-flight speculation before issuing the network op so request boundaries don't leak a pending DECODE_REPLY into the next request's RESET stream.
ds4_rpc.{h,c}: ~1500 lines of new wire code. Plain POSIX sockets, no threads. Ops: HELLO_CLIENT/SERVER (handshake), DECODE_REQ/REPLY, PREFILL_REQ/REPLY, RESET, REWIND, MTP_TRIM, VERIFY_BATCH, SHUTDOWN. Op dispatch on the tail uses MSG_PEEK so each *_recv helper can read its own header. SO_RCVTIMEO is 300s by default (override via DS4_RPC_RECV_TIMEOUT_SECS); the head closes its peer and NULLs rpc_peer on any RPC error so a tail crash surfaces as a clean error rather than indefinite hang.
ds4_rpc_worker.c: ~500 lines. ds4-rpc-worker binary: parses --layer-start/-end/-listen/--port/--ctx/--routed-quant-bits/--mtp [FILE]/--mtp-draft N/--mtp-margin F/--quant q2|q4; opens a partial-range engine; accepts one head connection; runs the serve loop with all opcodes dispatched. Exits cleanly on SHUTDOWN or when the head disconnects.
ds4_cli.c,
ds4_server.c: --rpc-peer host:port and --rpc-split L on both binaries; --quant q2|q4 to pick the canonical model file in ./gguf/. Default port 46434.
tests/ds4_test.c: new --pipeline-daisy-chain test. Opens three engines back-to-back in one process under the refcounted lock: full [0, 43), head [0, 21), tail [21, 43). Tokenizes "Hello", runs one decode on each, daisy-chains residual from head to tail, compares top logprobs. Tolerates small float drift from different Metal command-buffer split points; asserts top-3 token agreement and max logit diff < 0.1.
Makefile: ds4_rpc.o is part of CORE_OBJS; new ds4-rpc-worker target.
Single-host invariant
When opt->rpc_peer_host is unset and opt->n_layer_end normalizes to DS4_N_LAYER, every gated branch falls into the original code path:
- weights_bind requires every layer's tensors and the full output stack, same as before
- the per-layer forward loops iterate [0, DS4_N_LAYER)
- the embed step runs, the output head runs, logits are read
- ds4_session_eval_internal skips the RPC block
- ds4_session_sync takes the original prefill paths
./ds4_test --server and --metal-kernels pass.
MTP / prefetch (opt-in, env-gated)
The head's eval_internal has an optional Phase 6 path: speculatively run head's L0-L20 on the previous reply's drafts[0] and ship the spec request eagerly so the tail starts processing while the head is doing its sample-and-loop bookkeeping. On hit (~67% in measured data) the in-flight spec was right and head skips the real L0-L20 for the next token. On miss the head drains the spec reply, sends REWIND to roll the tail's KV back by one position, restores its own KV from a snapshot, and falls through to the synchronous path. A 32-cycle sliding window of hit/miss outcomes (Phase 6.7 adaptive) triggers a 32-cycle cooldown when hit rate drops below 50%, so worst-case workloads don't keep paying the miss tax.
Here's the performance on measured 2-Mac setup (M1 Ultra head + M3 Max tail, TB4 bridge, Q4):
Short = 35-token canned chat prompt, 256-token completion. Long = 7497-token prompt, 115-209 tokens of completion. Temperature 0; outputs are deterministic across runs. Decode t/s is the only knob the engine controls per token; prefill variance on the long row reflects warmup state (cold map + Metal residency), not algorithmic differences -- 5a's batched prefill landed the ~19x prefill win and that holds across 5d-6.
Not a fan of the MTP results. In general, prefetch wins when MTP's recursive predictions are accurate (chat, casual generation) and loses when they're not (code, technical text). Adaptive cooldown caps the damage but can't fully recover. An async-reader-thread variant (depth-1 with 2-step lookahead via drafts[1]) was attempted and rolled back -- the per-RPC threading overhead on macOS ate the theoretical 10% gain and added a 40% regression on short prompts. Maybe someone smarter than me can figure it out.
Operational rule:
- Interactive chat / short prompts: export DS4_RPC_PREFETCH=1
- Long-context analysis: unset DS4_RPC_PREFETCH
Same binaries either way.
--quant flag: pass q2 or q4 to specify the model. Default auto-detects the canonical file in ./gguf/ (prefers Q2 if both present).
Usage examples
Two-Mac setup (head on this box, tail on the MBP, TB4 bridge):
Single-prompt run (file in, file out):
Benchmarking:
Env vars (all opt-in):