Conversation
SecretSettler
left a comment
There was a problem hiding this comment.
Overall looks good to me. Is it possible to improve the installation so that user has options not building from source like vLLM CPU?
|
I have improved the installation, now pip install contextpilot works on both Mac/CPU and GPU machines, and pip install contextpilot[gpu] adds cupy-cuda12x. |
SecretSettler
left a comment
There was a problem hiding this comment.
Can you also modify the main README so that the user knows how to install with your new modification.
docs/guides/mac_llama_cpp.md
Outdated
| ```bash | ||
| pip install -r requirements-mac.txt && pip install -e . --no-deps |
There was a problem hiding this comment.
Can you also modify this?
docs/guides/mac_llama_cpp.md
Outdated
| CONTEXTPILOT_INDEX_URL=http://localhost:8765 \ | ||
| python contextpilot_edge/proxy_server.py | ||
| ``` | ||
|
|
||
| The proxy listens on `:8890`, forwards completions to llama-server on `:8889`, and logs cache and GPU metrics for every request. | ||
|
|
||
| **Terminal 3 — ContextPilot HTTP server:** | ||
|
|
||
| ```bash | ||
| python -m contextpilot.server.http_server --port 8765 \ | ||
| --infer-api-url http://localhost:8890 | ||
| ``` | ||
|
|
||
| ContextPilot points at the proxy (not directly at llama-server) so that all traffic is metered. |
There was a problem hiding this comment.
Why do we need two servers for this? From my perspective, you can merge them to only one like what we have done for SGLang and vLLM?
|
I have updated the related docs, and merged proxy server to the original llama.cpp server. |
SecretSettler
left a comment
There was a problem hiding this comment.
Please resolve conflicts and add llama.cpp hook to avoid manual patches.
Hook architecture (contextpilot/_llamacpp_hook.py) - Embeds a C++ shared library source that interposes llama_memory_seq_rm() via DYLD_INSERT_LIBRARIES (macOS) / LD_PRELOAD (Linux) — the same symbol llama.cpp calls when a KV-cache slot is discarded - Uses RTLD_NEXT to call through to the real function; fires POST /evict_slot instantly on full-slot clears (p0 < 0, p1 < 0) via a raw POSIX socket — no polling, zero latency, no external dependencies - build() compiles the library once and caches it in /tmp; launch() and preload_env() provide a Python API for injection Aligned activation pattern - Adds contextpilot-llama-server console script (registered via [project.scripts]): a drop-in for llama-server that injects the hook when CONTEXTPILOT_INDEX_URL is set and exec's llama-server directly when it is not - All three engines now share the same one-liner pattern: CONTEXTPILOT_INDEX_URL=http://localhost:8765 sglang serve ... CONTEXTPILOT_INDEX_URL=http://localhost:8765 vllm serve ... CONTEXTPILOT_INDEX_URL=http://localhost:8765 contextpilot-llama-server ... Server changes (http_server.py) - Added POST /register_slot — maps a llama.cpp slot ID to a request ID - Added POST /evict_slot — evicts a request by slot ID using the registry - Added slot registry reset in POST /reset Removed - patches/llama_cpp/ — eviction_proxy.py, apply_patch.sh, README.md Tests & docs - New tests/test_llamacpp_hook.py — 26 tests covering compilation, symbol export (C linkage, no mangling), preload_env, launch, and CLI modes - Updated tests/test_mac_contextpilot.sh to use contextpilot-llama-server - Updated all guides (mac_llama_cpp.md, online_usage.md, offline_usage.md, docs/README.md) to reflect the new architecture
contextpilot/server/http_server.py
Outdated
| @app.post("/evict_slot") | ||
| async def evict_slot(request: EvictSlotRequest): |
contextpilot/server/http_server.py
Outdated
| @app.post("/register_slot") | ||
| async def register_slot(request: RegisterSlotRequest): | ||
| """ |
There was a problem hiding this comment.
Why we need new endpoints?
contextpilot/install_standalone.py
Outdated
| print(" CONTEXTPILOT_INDEX_URL=http://host:8765 \\") | ||
| print(" CONTEXTPILOT_LLAMACPP_URL=http://localhost:8889 \\") |
There was a problem hiding this comment.
Why do we need llamacpp_URL? Is it possible to make it the same as vLLM and SGLang?
contextpilot/standalone_hook.py
Outdated
| logger = logging.getLogger("contextpilot_hook") | ||
|
|
||
| CONTEXTPILOT_INDEX_URL = os.environ.get("CONTEXTPILOT_INDEX_URL") | ||
| CONTEXTPILOT_LLAMACPP_URL = os.environ.get("CONTEXTPILOT_LLAMACPP_URL") |
docs/guides/online_usage.md
Outdated
| | `/register_slot` | POST | Map a llama.cpp slot_id → request_id (native hook integration) | | ||
| | `/evict_slot` | POST | Evict by slot_id — called automatically by the native C++ hook | |
There was a problem hiding this comment.
Is it possible to not add these two endpoints?
Remove /register_slot and /evict_slot endpoints
The two new endpoints introduced for llama.cpp eviction tracking have been replaced with a naming convention: the C++ hook now posts POST /evict {"request_ids":
["slot_N"]} — reusing the existing /evict endpoint that SGLang and vLLM already call. This eliminates the slot registry (_slot_registry), RegisterSlotRequest,
EvictSlotRequest, and the two dedicated endpoints from http_server.py, and removes the slot assignment machinery (_discover_n_slots, _next_slot, _register_slot)
from proxy_server.py.
Remove CONTEXTPILOT_LLAMACPP_URL environment variable
The standalone polling watcher (standalone_hook.py) previously auto-started when both CONTEXTPILOT_INDEX_URL and CONTEXTPILOT_LLAMACPP_URL were set. The second
env var is gone — the llama-server URL is now a positional CLI argument:
# before
CONTEXTPILOT_INDEX_URL=http://localhost:8765 \
CONTEXTPILOT_LLAMACPP_URL=http://localhost:8889 \
python contextpilot_hook.py
# after — same as running any other script
CONTEXTPILOT_INDEX_URL=http://localhost:8765 \
python contextpilot_hook.py http://localhost:8889
SGLang and vLLM were never affected (they only need CONTEXTPILOT_INDEX_URL).
Add contextpilot-llama-server console script
Registered via [project.scripts] in pyproject.toml. After pip install contextpilot, all three engines share the same one-env-var activation pattern:
CONTEXTPILOT_INDEX_URL=http://localhost:8765 sglang serve ...
CONTEXTPILOT_INDEX_URL=http://localhost:8765 vllm serve ...
CONTEXTPILOT_INDEX_URL=http://localhost:8765 contextpilot-llama-server ...
Add Mac benchmark results to README
Llama-3.2-1B-Instruct-Q4_K_M on Apple M3 MacBook Air (16 GB), MultihopRAG 100 queries: ContextPilot reduces average end-to-end inference latency from 3,315 ms →
1,378 ms (2.4×).
ContextPilot now runs fully on Apple Silicon using llama.cpp as the inference backend — no CUDA, no cloud, and no external services required.
This makes it a strong fit for private OpenClaw deployments and fully local LLM setups.
More details are available in the guide document.