Skip to content

Mac + llama.cpp Support#16

Merged
SecretSettler merged 11 commits intomainfrom
contextpilot_mac
Mar 4, 2026
Merged

Mac + llama.cpp Support#16
SecretSettler merged 11 commits intomainfrom
contextpilot_mac

Conversation

@davendw49
Copy link
Contributor

ContextPilot now runs fully on Apple Silicon using llama.cpp as the inference backend — no CUDA, no cloud, and no external services required.

This makes it a strong fit for private OpenClaw deployments and fully local LLM setups.

More details are available in the guide document.

Copy link
Member

@SecretSettler SecretSettler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. Is it possible to improve the installation so that user has options not building from source like vLLM CPU?

@davendw49
Copy link
Contributor Author

I have improved the installation, now pip install contextpilot works on both Mac/CPU and GPU machines, and pip install contextpilot[gpu] adds cupy-cuda12x.

@SecretSettler SecretSettler self-requested a review February 28, 2026 14:08
Copy link
Member

@SecretSettler SecretSettler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also modify the main README so that the user knows how to install with your new modification.

Comment on lines +44 to +45
```bash
pip install -r requirements-mac.txt && pip install -e . --no-deps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also modify this?

Comment on lines +68 to +81
CONTEXTPILOT_INDEX_URL=http://localhost:8765 \
python contextpilot_edge/proxy_server.py
```

The proxy listens on `:8890`, forwards completions to llama-server on `:8889`, and logs cache and GPU metrics for every request.

**Terminal 3 — ContextPilot HTTP server:**

```bash
python -m contextpilot.server.http_server --port 8765 \
--infer-api-url http://localhost:8890
```

ContextPilot points at the proxy (not directly at llama-server) so that all traffic is metered.
Copy link
Member

@SecretSettler SecretSettler Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need two servers for this? From my perspective, you can merge them to only one like what we have done for SGLang and vLLM?

@davendw49
Copy link
Contributor Author

I have updated the related docs, and merged proxy server to the original llama.cpp server.

@SecretSettler SecretSettler self-requested a review March 1, 2026 20:14
Copy link
Member

@SecretSettler SecretSettler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve conflicts and add llama.cpp hook to avoid manual patches.

Hook architecture (contextpilot/_llamacpp_hook.py)
  - Embeds a C++ shared library source that interposes llama_memory_seq_rm() via DYLD_INSERT_LIBRARIES (macOS) / LD_PRELOAD (Linux) — the same symbol llama.cpp
  calls when a KV-cache slot is discarded
  - Uses RTLD_NEXT to call through to the real function; fires POST /evict_slot instantly on full-slot clears (p0 < 0, p1 < 0) via a raw POSIX socket — no
  polling, zero latency, no external dependencies
  - build() compiles the library once and caches it in /tmp; launch() and preload_env() provide a Python API for injection

  Aligned activation pattern
  - Adds contextpilot-llama-server console script (registered via [project.scripts]): a drop-in for llama-server that injects the hook when CONTEXTPILOT_INDEX_URL
   is set and exec's llama-server directly when it is not
  - All three engines now share the same one-liner pattern:
  CONTEXTPILOT_INDEX_URL=http://localhost:8765 sglang serve ...
  CONTEXTPILOT_INDEX_URL=http://localhost:8765 vllm serve ...
  CONTEXTPILOT_INDEX_URL=http://localhost:8765 contextpilot-llama-server ...

  Server changes (http_server.py)
  - Added POST /register_slot — maps a llama.cpp slot ID to a request ID
  - Added POST /evict_slot — evicts a request by slot ID using the registry
  - Added slot registry reset in POST /reset

  Removed
  - patches/llama_cpp/ — eviction_proxy.py, apply_patch.sh, README.md

  Tests & docs
  - New tests/test_llamacpp_hook.py — 26 tests covering compilation, symbol export (C linkage, no mangling), preload_env, launch, and CLI modes
  - Updated tests/test_mac_contextpilot.sh to use contextpilot-llama-server
  - Updated all guides (mac_llama_cpp.md, online_usage.md, offline_usage.md, docs/README.md) to reflect the new architecture
Comment on lines +729 to +730
@app.post("/evict_slot")
async def evict_slot(request: EvictSlotRequest):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above.

Comment on lines +702 to +704
@app.post("/register_slot")
async def register_slot(request: RegisterSlotRequest):
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need new endpoints?

Comment on lines +63 to +64
print(" CONTEXTPILOT_INDEX_URL=http://host:8765 \\")
print(" CONTEXTPILOT_LLAMACPP_URL=http://localhost:8889 \\")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need llamacpp_URL? Is it possible to make it the same as vLLM and SGLang?

logger = logging.getLogger("contextpilot_hook")

CONTEXTPILOT_INDEX_URL = os.environ.get("CONTEXTPILOT_INDEX_URL")
CONTEXTPILOT_LLAMACPP_URL = os.environ.get("CONTEXTPILOT_LLAMACPP_URL")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Comment on lines +323 to +324
| `/register_slot` | POST | Map a llama.cpp slot_id → request_id (native hook integration) |
| `/evict_slot` | POST | Evict by slot_id — called automatically by the native C++ hook |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to not add these two endpoints?

Remove /register_slot and /evict_slot endpoints

  The two new endpoints introduced for llama.cpp eviction tracking have been replaced with a naming convention: the C++ hook now posts POST /evict {"request_ids":
   ["slot_N"]} — reusing the existing /evict endpoint that SGLang and vLLM already call. This eliminates the slot registry (_slot_registry), RegisterSlotRequest,
  EvictSlotRequest, and the two dedicated endpoints from http_server.py, and removes the slot assignment machinery (_discover_n_slots, _next_slot, _register_slot)
   from proxy_server.py.

  Remove CONTEXTPILOT_LLAMACPP_URL environment variable

  The standalone polling watcher (standalone_hook.py) previously auto-started when both CONTEXTPILOT_INDEX_URL and CONTEXTPILOT_LLAMACPP_URL were set. The second
  env var is gone — the llama-server URL is now a positional CLI argument:

  # before
  CONTEXTPILOT_INDEX_URL=http://localhost:8765 \
  CONTEXTPILOT_LLAMACPP_URL=http://localhost:8889 \
  python contextpilot_hook.py

  # after — same as running any other script
  CONTEXTPILOT_INDEX_URL=http://localhost:8765 \
  python contextpilot_hook.py http://localhost:8889

  SGLang and vLLM were never affected (they only need CONTEXTPILOT_INDEX_URL).

  Add contextpilot-llama-server console script

  Registered via [project.scripts] in pyproject.toml. After pip install contextpilot, all three engines share the same one-env-var activation pattern:

  CONTEXTPILOT_INDEX_URL=http://localhost:8765 sglang serve ...
  CONTEXTPILOT_INDEX_URL=http://localhost:8765 vllm serve ...
  CONTEXTPILOT_INDEX_URL=http://localhost:8765 contextpilot-llama-server ...

  Add Mac benchmark results to README

  Llama-3.2-1B-Instruct-Q4_K_M on Apple M3 MacBook Air (16 GB), MultihopRAG 100 queries: ContextPilot reduces average end-to-end inference latency from 3,315 ms →
   1,378 ms (2.4×).
Copy link
Member

@SecretSettler SecretSettler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SecretSettler SecretSettler merged commit 547ba8f into main Mar 4, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants