Bug
embed() hangs sporadically on Apple M5 Max with Metal backend. The same query succeeds in ~0.7s on one call and hangs indefinitely (99% CPU on one core) on the next. The behavior is not query-dependent — identical inputs produce different outcomes. CPU-only mode (gpu: false) is 100% stable but ~50x slower.
Reproduction
Minimal test: run the same embedding call 10 times sequentially. Expect 1-3 hangs per 10 runs.
import { getLlama } from "node-llama-cpp";
const llama = await getLlama(); // Metal auto-detected
const model = await llama.loadModel({
modelPath: "nomic-embed-text-v2-moe.Q8_0.gguf" // 488 MB MoE embedding model
});
const ctx = await model.createEmbeddingContext();
for (let i = 0; i < 10; i++) {
console.time(`run ${i}`);
const vec = await ctx.getEmbeddingFor("test query about delegation");
console.timeEnd(`run ${i}`); // ~0.7s when it works, never completes when it hangs
}
In practice, observed via qmd CLI which calls embed() for vector search:
# Run 10 sequential vsearch calls, 8s timeout each:
export GGML_METAL_NO_RESIDENCY=1
for i in $(seq 1 10); do
bun dist/cli/qmd.js vsearch "Delegieren" -n 3 &
pid=$!; sleep 8
kill -0 $pid 2>/dev/null && { kill $pid; echo "Run $i: HANG"; } || echo "Run $i: OK"
done
# Typical result: 9 OK, 1 HANG (without GGML_METAL_NO_RESIDENCY: 5-7 OK, 3-5 HANG)
Workaround
GGML_METAL_NO_RESIDENCY=1 reduces hang rate from ~30-50% to ~10% but does not eliminate it. This env var disables Metal residency sets for buffer allocation (related to llama.cpp autorelease/buffer management, see ggml-org/llama.cpp#18568).
What doesn't help
- Upgrading llama.cpp: Rebuilt with b8783 (from b8390 in 3.18.1) — hang rate actually increased to ~40%
GGML_METAL_TENSOR_DISABLE=1: No improvement, possibly worse
- Both flags combined: Worse than
GGML_METAL_NO_RESIDENCY alone
- AbortSignal:
signal parameter on prompt() is not respected during native Metal compute — the call blocks the event loop
Observations
- The hang occurs in the native Metal compute pipeline, not in JS
- Process shows 99-100% CPU on one core during hang
- No crash, no error, no output — just infinite computation
- The model (nomic-embed-text-v2-moe) is a Mixture-of-Experts architecture — MoE routing may trigger different Metal kernel paths
- First call after process start seems slightly more likely to hang (cold start)
- Concurrent
embed() calls guarantee deadlock (separate known issue)
Related issues
Environment
- node-llama-cpp: 3.18.1
- llama.cpp: b8390 (prebuilt)
- Hardware: Apple M5 Max, 128 GB RAM
- OS: macOS 26.4.1 (Tahoe)
- Runtime: Bun 1.3.12 (also tested with Node 25.9.0 — same behavior)
- Model: nomic-embed-text-v2-moe Q8_0 (488 MB, 768 dimensions, MoE)
Bug
embed()hangs sporadically on Apple M5 Max with Metal backend. The same query succeeds in ~0.7s on one call and hangs indefinitely (99% CPU on one core) on the next. The behavior is not query-dependent — identical inputs produce different outcomes. CPU-only mode (gpu: false) is 100% stable but ~50x slower.Reproduction
Minimal test: run the same embedding call 10 times sequentially. Expect 1-3 hangs per 10 runs.
In practice, observed via qmd CLI which calls
embed()for vector search:Workaround
GGML_METAL_NO_RESIDENCY=1reduces hang rate from ~30-50% to ~10% but does not eliminate it. This env var disables Metal residency sets for buffer allocation (related to llama.cpp autorelease/buffer management, see ggml-org/llama.cpp#18568).What doesn't help
GGML_METAL_TENSOR_DISABLE=1: No improvement, possibly worseGGML_METAL_NO_RESIDENCYalonesignalparameter onprompt()is not respected during native Metal compute — the call blocks the event loopObservations
embed()calls guarantee deadlock (separate known issue)Related issues
Environment