fix(ds4): Implement MoE low-memory streaming to work around macOS's kernel bug.#73
Open
HomeroRR wants to merge 1 commit into
Open
fix(ds4): Implement MoE low-memory streaming to work around macOS's kernel bug.#73HomeroRR wants to merge 1 commit into
HomeroRR wants to merge 1 commit into
Conversation
Owner
|
Hi @HomeroRR thank you for the PR. I would have /v1/models as it is without additional fields other than openrouter / openai stuff, and would instead add our ds4 specific things in /props. So we overload the llama.cpp convention with our things in a similar way to llama.cpp, but we can leave the "standard" endpoint as clean as possible. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CPU
--low-memStreaming Implementation SummaryWhat Was Implemented
CPU streaming feature for DeepSeek V4 inference on 8GB RAM systems, complementing the already-working Metal GPU streaming path.
Architecture
Two-Phase Per-Layer Processing:
Memory Model:
Code Changes
--cpu --low-memrejection guardstream_selected[],stream_router_weights[],layer_buf,expert_bufto structstreaming_phase,expert_model,scratchparams tolayer_routed_moe_one_prealloc()with phase logiclayer_ffn_one_decode_scratch()layer_forward_raw_swa_one()forward_token_raw_swa_cpu_streaming()function (new)generate_raw_swa_cpu()with buffer allocation/deallocation0, NULL(non-streaming) or appropriate phase valuesReused Infrastructure
stream_load_layer(): Loads all 32 non-expert tensors per layerstream_layer_build_temp(): Creates temporary model pointing to stream bufferstream_load_experts(): Packs 6 selected experts into contiguous layoutTwo-Phase Logic
Key Design Decisions
streaming_phase=0, expert_model=NULLTesting
A comprehensive test plan has been created: CPU_STREAMING_TEST_PLAN.md
Quick Verification (Non-Building)
Build & Test
Known Limitations
lm_head Not Streamed (~536 MB always resident)
Token-by-Token Prefill (not batched)
macOS Metal Requires macOS
Related Work
This implementation completes a 15-bug fix series:
All 15 bugs are now fixed across both Metal and CPU paths.
What's NOT Included (Deferred)