Mac + llama.cpp Support by davendw49 · Pull Request #16 · EfficientContext/ContextPilot

davendw49 · 2026-02-27T18:55:27Z

ContextPilot now runs fully on Apple Silicon using llama.cpp as the inference backend — no CUDA, no cloud, and no external services required.

This makes it a strong fit for private OpenClaw deployments and fully local LLM setups.

More details are available in the guide document.

SecretSettler

Overall looks good to me. Is it possible to improve the installation so that user has options not building from source like vLLM CPU?

davendw49 · 2026-02-27T21:57:15Z

I have improved the installation, now pip install contextpilot works on both Mac/CPU and GPU machines, and pip install contextpilot[gpu] adds cupy-cuda12x.

SecretSettler

Can you also modify the main README so that the user knows how to install with your new modification.

SecretSettler · 2026-02-28T14:17:41Z

docs/guides/mac_llama_cpp.md

+```bash
+pip install -r requirements-mac.txt && pip install -e . --no-deps


Can you also modify this?

SecretSettler · 2026-02-28T14:19:10Z

docs/guides/mac_llama_cpp.md

+CONTEXTPILOT_INDEX_URL=http://localhost:8765 \
+    python contextpilot_edge/proxy_server.py
+```
+
+The proxy listens on `:8890`, forwards completions to llama-server on `:8889`, and logs cache and GPU metrics for every request.
+
+**Terminal 3 — ContextPilot HTTP server:**
+
+```bash
+python -m contextpilot.server.http_server --port 8765 \
+    --infer-api-url http://localhost:8890
+```
+
+ContextPilot points at the proxy (not directly at llama-server) so that all traffic is metered.


Why do we need two servers for this? From my perspective, you can merge them to only one like what we have done for SGLang and vLLM?

davendw49 · 2026-03-01T20:12:20Z

I have updated the related docs, and merged proxy server to the original llama.cpp server.

SecretSettler

Please resolve conflicts and add llama.cpp hook to avoid manual patches.

Hook architecture (contextpilot/_llamacpp_hook.py) - Embeds a C++ shared library source that interposes llama_memory_seq_rm() via DYLD_INSERT_LIBRARIES (macOS) / LD_PRELOAD (Linux) — the same symbol llama.cpp calls when a KV-cache slot is discarded - Uses RTLD_NEXT to call through to the real function; fires POST /evict_slot instantly on full-slot clears (p0 < 0, p1 < 0) via a raw POSIX socket — no polling, zero latency, no external dependencies - build() compiles the library once and caches it in /tmp; launch() and preload_env() provide a Python API for injection Aligned activation pattern - Adds contextpilot-llama-server console script (registered via [project.scripts]): a drop-in for llama-server that injects the hook when CONTEXTPILOT_INDEX_URL is set and exec's llama-server directly when it is not - All three engines now share the same one-liner pattern: CONTEXTPILOT_INDEX_URL=http://localhost:8765 sglang serve ... CONTEXTPILOT_INDEX_URL=http://localhost:8765 vllm serve ... CONTEXTPILOT_INDEX_URL=http://localhost:8765 contextpilot-llama-server ... Server changes (http_server.py) - Added POST /register_slot — maps a llama.cpp slot ID to a request ID - Added POST /evict_slot — evicts a request by slot ID using the registry - Added slot registry reset in POST /reset Removed - patches/llama_cpp/ — eviction_proxy.py, apply_patch.sh, README.md Tests & docs - New tests/test_llamacpp_hook.py — 26 tests covering compilation, symbol export (C linkage, no mangling), preload_env, launch, and CLI modes - Updated tests/test_mac_contextpilot.sh to use contextpilot-llama-server - Updated all guides (mac_llama_cpp.md, online_usage.md, offline_usage.md, docs/README.md) to reflect the new architecture

SecretSettler · 2026-03-04T01:29:49Z

contextpilot/server/http_server.py

+@app.post("/evict_slot")
+async def evict_slot(request: EvictSlotRequest):


The same as above.

SecretSettler · 2026-03-04T01:30:12Z

contextpilot/server/http_server.py

+@app.post("/register_slot")
+async def register_slot(request: RegisterSlotRequest):
+    """


Why we need new endpoints?

SecretSettler · 2026-03-04T01:31:17Z

contextpilot/install_standalone.py

+    print("  CONTEXTPILOT_INDEX_URL=http://host:8765 \\")
+    print("  CONTEXTPILOT_LLAMACPP_URL=http://localhost:8889 \\")


Why do we need llamacpp_URL? Is it possible to make it the same as vLLM and SGLang?

SecretSettler · 2026-03-04T01:31:36Z

contextpilot/standalone_hook.py

 logger = logging.getLogger("contextpilot_hook")

 CONTEXTPILOT_INDEX_URL = os.environ.get("CONTEXTPILOT_INDEX_URL")
+CONTEXTPILOT_LLAMACPP_URL = os.environ.get("CONTEXTPILOT_LLAMACPP_URL")


SecretSettler · 2026-03-04T01:34:12Z

docs/guides/online_usage.md

+| `/register_slot` | POST | Map a llama.cpp slot_id → request_id (native hook integration) |
+| `/evict_slot` | POST | Evict by slot_id — called automatically by the native C++ hook |


Is it possible to not add these two endpoints?

Remove /register_slot and /evict_slot endpoints The two new endpoints introduced for llama.cpp eviction tracking have been replaced with a naming convention: the C++ hook now posts POST /evict {"request_ids": ["slot_N"]} — reusing the existing /evict endpoint that SGLang and vLLM already call. This eliminates the slot registry (_slot_registry), RegisterSlotRequest, EvictSlotRequest, and the two dedicated endpoints from http_server.py, and removes the slot assignment machinery (_discover_n_slots, _next_slot, _register_slot) from proxy_server.py. Remove CONTEXTPILOT_LLAMACPP_URL environment variable The standalone polling watcher (standalone_hook.py) previously auto-started when both CONTEXTPILOT_INDEX_URL and CONTEXTPILOT_LLAMACPP_URL were set. The second env var is gone — the llama-server URL is now a positional CLI argument: # before CONTEXTPILOT_INDEX_URL=http://localhost:8765 \ CONTEXTPILOT_LLAMACPP_URL=http://localhost:8889 \ python contextpilot_hook.py # after — same as running any other script CONTEXTPILOT_INDEX_URL=http://localhost:8765 \ python contextpilot_hook.py http://localhost:8889 SGLang and vLLM were never affected (they only need CONTEXTPILOT_INDEX_URL). Add contextpilot-llama-server console script Registered via [project.scripts] in pyproject.toml. After pip install contextpilot, all three engines share the same one-env-var activation pattern: CONTEXTPILOT_INDEX_URL=http://localhost:8765 sglang serve ... CONTEXTPILOT_INDEX_URL=http://localhost:8765 vllm serve ... CONTEXTPILOT_INDEX_URL=http://localhost:8765 contextpilot-llama-server ... Add Mac benchmark results to README Llama-3.2-1B-Instruct-Q4_K_M on Apple M3 MacBook Air (16 GB), MultihopRAG 100 queries: ContextPilot reduces average end-to-end inference latency from 3,315 ms → 1,378 ms (2.4×).

SecretSettler

LGTM

davendw49 added 3 commits February 27, 2026 18:22

Support llama.cpp

db26232

Create mac_llama_cpp.md

a0e964c

Update mac_llama_cpp.md

af3166c

SecretSettler requested changes Feb 27, 2026

View reviewed changes

Update: easy install

56825e7

SecretSettler self-requested a review February 28, 2026 14:08

SecretSettler requested changes Feb 28, 2026

View reviewed changes

Update llama.cpp related docs and scripts

95cbbb0

SecretSettler self-requested a review March 1, 2026 20:14

Update the installation info

aa7b792

SecretSettler requested changes Mar 2, 2026

View reviewed changes

davendw49 added 2 commits March 3, 2026 00:50

Merge branch 'main' into contextpilot_mac

fcbc262

SecretSettler requested changes Mar 4, 2026

View reviewed changes

davendw49 added 3 commits March 4, 2026 01:48

Merge branch 'main' into contextpilot_mac

0b03f79

Merge branch 'main' into contextpilot_mac

e5981eb

SecretSettler approved these changes Mar 4, 2026

View reviewed changes

SecretSettler merged commit 547ba8f into main Mar 4, 2026
2 checks passed

		```bash
		pip install -r requirements-mac.txt && pip install -e . --no-deps

		@app.post("/evict_slot")
		async def evict_slot(request: EvictSlotRequest):

		print(" CONTEXTPILOT_INDEX_URL=http://host:8765 \\")
		print(" CONTEXTPILOT_LLAMACPP_URL=http://localhost:8889 \\")

		\| `/register_slot` \| POST \| Map a llama.cpp slot_id → request_id (native hook integration) \|
		\| `/evict_slot` \| POST \| Evict by slot_id — called automatically by the native C++ hook \|

Conversation

davendw49 commented Feb 27, 2026

Uh oh!

SecretSettler left a comment

Choose a reason for hiding this comment

Uh oh!

davendw49 commented Feb 27, 2026

Uh oh!

SecretSettler left a comment

Choose a reason for hiding this comment

Uh oh!

SecretSettler Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

SecretSettler Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davendw49 commented Mar 1, 2026

Uh oh!

SecretSettler left a comment

Choose a reason for hiding this comment

Uh oh!

SecretSettler Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

SecretSettler Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

SecretSettler Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

SecretSettler Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

SecretSettler Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

SecretSettler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SecretSettler Feb 28, 2026 •

edited

Loading