Idle model unload for mediapipe LLM graphs (#4141)#4332
Conversation
2e5439b to
e9940c6
Compare
Hardening + verification updateSince the initial version this PR underwent a deep correctness/concurrency pass. Summary of what changed and how it was verified. Robustness improvements added
Testing Unit (model-free): state-machine transitions ( Functional (real End-to-end on an Intel Arc B70 GPU:
gRPC/KFS path: the lazy-wake lives in Regression: ran the full Concurrency review. Reviewed adversarially across four passes; the final pass is a GO with no CRITICAL/HIGH issues. Lock ordering is one-directional ( Known limitations / follow-ups (non-blocking)
|
Adds an opt-in idle timeout that unloads an LLM graph's heavy resources (freeing GPU/CPU memory) after a period with no inference, and lazily reloads on the next request, so the GPU can be shared with other workloads. Follows llama.cpp's --sleep-idle-seconds model. - Config: idle_unload_timeout_seconds on the mediapipe config entry (0 = disabled, default). Added to the JSON schema. Phase-1 scope: restricted to LLM continuous-batching graphs (HttpLLMCalculator); rejected for graphs with Python nodes or non-LLM calculators at config validation. - State machine: new UNLOADED state + UnloadEvent. AVAILABLE->UNLOADED on idle; UNLOADED wakes via the existing reload path. isAvailable() is false for UNLOADED, convertToModelStatus() maps it to AVAILABLE so health/ readiness still see the servable (it auto-reloads). - Concurrency: a per-definition recursive lifecycleMtx serializes reload/retire/unload/wakeUp so the watcher and config threads never race on graph state or side packets. unload() is non-blocking, tears down only after confirming the AVAILABLE->UNLOADED transition, and skips while requests or inferences are in flight. The idle timeout is cached in an atomic for lock-free watcher reads. - In-flight guard: an RAII ActiveInferenceGuard (held for the executor's lifetime) keeps a shared_ptr inference counter so a generation that outlives the idle timeout is never unloaded mid-stream; completing an inference refreshes the activity timestamp. lastActivityTimeNs and the counter are shared_ptr so they outlive the definition if an executor is still running during retire. - Wake-up failure is retryable: if the lazy reload fails, the graph reverts to UNLOADED (not a wedged failed state) so the next request re-attempts the wake and self-heals once the underlying issue is resolved; the current request gets a clean error. - Metrics: ovms_graph_loaded gauge (1 loaded / 0 unloaded) per graph. - Composes with --cache_dir so wake-up is a cache import, not a recompile. Tested: state-machine + schema unit tests; in-flight-guard and wake-failure unit tests; functional unload/reload/idle-reset/disabled-default/concurrency with a real model; and end-to-end on an Intel Arc GPU (idle unload frees resources, long generations are not unloaded mid-stream, wake-up reloads and serves, soak shows no leak, failed wake self-heals). No regressions vs main. Implements openvinotoolkit#4141 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
e9940c6 to
65f2bab
Compare
Summary
Closes #4141 — opt-in idle unload of LLM graphs to free GPU/CPU memory while idle, with transparent lazy reload on the next request. This lets the GPU be shared with other workloads instead of pinning VRAM for idle models. Design follows the model llama.cpp established with
--sleep-idle-seconds(unload on idle, lazy reload on next request), adapted to the OVMS model-manager lifecycle.Behavior
idle_unload_timeout_secondson the mediapipe config entry.0= disabled (default, no behavior change). Added to the JSON schema.timeoutseconds with no inference, the model-manager watcher unloads the graph: it frees the graph queue and the side-packet resources (theGenAiServable/ContinuousBatchingPipelineholding VRAM).AVAILABLE.Design
UNLOADEDstate +UnloadEventinPipelineDefinitionStatus. OnlyAVAILABLEtransitions on unload; wake-up reuses the existing reload path (ReloadEvent→RELOADING→AVAILABLE).isAvailable()is false forUNLOADED, butconvertToModelStatus()maps it toAVAILABLEso health/readiness and orchestration still treat the servable as serveable (it auto-reloads on demand) — mirroring how llama.cpp keepsis_sleepingout of/health.MediapipeGraphDefinition::create()); status/health/metrics endpoints never go through that path, so health probes can't keep a model awake.lifecycleMtxserializesreload()/retire()/unload()/wakeUpIfUnloaded(), so the watcher thread and the config thread can never race on the graph's state machine or side packets.unload()is non-blocking — it skips graphs with in-flight requests and tears down only after confirming theAVAILABLE→UNLOADEDtransition actually happened.createPipeline()wakes anUNLOADEDgraph with a bounded retry, re-fetching the definition each iteration to avoid stale pointers. The idle timeout is cached in an atomic so the watcher never reads config lock-free. In-flight requests stay safe via the existingGraphIdGuardshared_ptrmechanism and therequestsHandlesCounterdrain protocol.ovms_graph_loadedgauge (1 loaded / 0 unloaded) per graph.--cache_dir(--cache_dir not propagated to LLM continuous batching pipeline (regression vs 2025.4 promise) #4230) so wake-up is a cache import rather than a full recompile.Scope
Phase 1 targets LLM continuous-batching graphs; graphs containing Python nodes are rejected at config validation when the timeout is set (Python teardown has GIL/threading concerns to address separately).
Testing
Built with MSVC and run locally (Windows,
--config=win_mp_on_py_off).Unit tests (model-free):
PipelineDefinitionStatustransitions:AVAILABLE→UNLOADED,UNLOADED→RELOADING→AVAILABLE,UNLOADED→RETIRED,UnloadEventis a no-op on every other state,isAvailable()false forUNLOADED,convertToModelStatus()→AVAILABLE.idle_unload_timeout_secondsaccepted; negative / wrong-type rejected.unload()is a no-op (no teardown) when state ≠AVAILABLE(e.g.RELOADING/BEGIN), and skips when requests are in flight.Functional tests (real
facebook/opt-125m):create()resets the idle timer; disabled-by-default never unloads; Python-node rejection (py-on builds); concurrent wake-up all endAVAILABLE; concurrent unload/reload/retire stress ends in a consistent state with no crash.End-to-end on an Intel Arc B70 GPU:
idle_unload_timeout_seconds: 8; the watcher unloaded it after the idle period (logged "freed GPU/CPU resources"), and the next request transparently woke it (wake-up completed in 831ms) and returned a valid completion.Regression:
ModelManager*and*Mediapipe*suites and compared against a clean-mainbaseline build: identical failure set (the only failures are pre-existing, caused by test model assets not present in this local environment) — zero new failures introduced by this change.Concurrency review
The concurrency design was put through three independent adversarial review passes. The first surfaced a critical unconditional-teardown bug (the watcher could tear down a graph mid-reload) and several races; the fixes were re-reviewed, which caught a new data race introduced by relocating the idle sweep out of
configMtx; that was fixed by the recursivelifecycleMtxserialization above. The final pass verified: lifecycle mutators are mutually exclusive, lock ordering is one-directional (configMtx→lifecycleMtx, never the reverse → no deadlock), the drain-under-lock cannot deadlock (the only counter-decrementing path is a lock-free atomic released before inference and never toucheslifecycleMtx), and the remaining items are LOW/benign — no CRITICAL/HIGH issues outstanding.Notes