kv-cache: SWA checkpoints store only non-masked cells (cherry-pick #23981)#27
Open
lalalune wants to merge 2 commits into
Open
kv-cache: SWA checkpoints store only non-masked cells (cherry-pick #23981)#27lalalune wants to merge 2 commits into
lalalune wants to merge 2 commits into
Conversation
…eRT/MLX scaffolds (M4/M5)
Lets the one streaming-LLM FFI pipe (eliza_inference_llm_stream_*) be served
by more than one in-process runtime, selected per-_open, without touching the
default llama.cpp path. Realizes M3 of the Gemma 4 cutover and lands the
device-gated M4/M5 backends on top.
M3 seam (always compiled, inert by default):
- src/llm-backend.h — LlmBackendSession / LlmBackendFactory pure-virtual
interfaces mirroring the FFI 1:1, plus llm_backend_context_bundle_dir(ctx),
the one accessor a backend uses to read the bundle root from the otherwise
opaque EliInferenceContext (no can_serve->open bundle-dir caching).
- src/llm-backend-selector.cpp — idempotent registry + selection: ELIZA_LLM_BACKEND
env hard-select, else highest preference_rank among available()+can_serve();
nullptr+no-error => keep in-tree llama.cpp. With no -DELIZA_ENABLE_* gate, no
backend registers, so select() always returns nullptr.
- eliza-inference-ffi.cpp — one `if (stream->backend) return stream->backend->X()`
branch inserted ABOVE each existing llama.cpp/MTP branch in open/prefill/next/
cancel/reset/reset_keep/save_slot/restore_slot/close. Device-critical path
untouched, just guarded.
M4 LiteRT-LM (gate -DELIZA_ENABLE_LITERT, OFF): src/backends/litert-backend.{h,cpp}
— Engine/Session against the researched LiteRT-LM C++ API, NPU->GPU->CPU ladder,
text/*.litertlm probe; no-SDK stub when OFF.
M5 CoreML/MLX (gate -DELIZA_ENABLE_MLX + __APPLE__, OFF; FATALs on non-Apple):
src/backends/mlx-coreml-backend.{h,mm} — MLX-primary (mlx-c decode graph) +
CoreML-alternate (stateful MLState KV); no-SDK stub when OFF.
CMake: selector folded into OMNIVOICE_FFI_SOURCES (always built); the two
accelerator backends gated with SDK include/link knobs. Default fused build
verified on Linux: libelizainference.so links, the FFI pipe stays exported,
litert_/mlx_coreml_backend_factory absent (gates OFF) — byte-for-byte the prior
llama.cpp path. Every hardware assumption tagged DEVICE-VERIFY.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-picks upstream ggml-org/llama.cpp#23981 (commit 2365315) into the eliza fork.
Why: Gemma 4 is notorious for KV-checkpoint RAM blow-up (upstream #21690 OOM). This makes
llama_kv_cache::state_writeskip SWA-masked cells, shrinking every spec-decode rollback checkpoint (our FFIcommon_prompt_checkpoint → llama_state_seq_get_data_ext → state_writepath). Directly relevant to the eliza-1 Gemma 4 cutover (#9033 in elizaOS/eliza).Verified: cherry-pick clean; CPU rebuild green (build 10027); Gemma 4 E2B (Q8_0) still runs (llama-bench pp64/tg32 nominal). Deps (
is_masked_swa/n_swa/swa_type) already present in the fork.🤖 Generated with Claude Code