Skip to content

Idle model unload for mediapipe LLM graphs (#4141)#4332

Open
exzile wants to merge 1 commit into
openvinotoolkit:mainfrom
exzile:feature/idle-model-unload
Open

Idle model unload for mediapipe LLM graphs (#4141)#4332
exzile wants to merge 1 commit into
openvinotoolkit:mainfrom
exzile:feature/idle-model-unload

Conversation

@exzile

@exzile exzile commented Jun 26, 2026

Copy link
Copy Markdown

Summary

Closes #4141 — opt-in idle unload of LLM graphs to free GPU/CPU memory while idle, with transparent lazy reload on the next request. This lets the GPU be shared with other workloads instead of pinning VRAM for idle models. Design follows the model llama.cpp established with --sleep-idle-seconds (unload on idle, lazy reload on next request), adapted to the OVMS model-manager lifecycle.

Behavior

  • New config field idle_unload_timeout_seconds on the mediapipe config entry. 0 = disabled (default, no behavior change). Added to the JSON schema.
  • After timeout seconds with no inference, the model-manager watcher unloads the graph: it frees the graph queue and the side-packet resources (the GenAiServable/ContinuousBatchingPipeline holding VRAM).
  • The next inference request transparently triggers a wake-up reload and is served once the graph is back to AVAILABLE.

Design

  • New UNLOADED state + UnloadEvent in PipelineDefinitionStatus. Only AVAILABLE transitions on unload; wake-up reuses the existing reload path (ReloadEventRELOADINGAVAILABLE). isAvailable() is false for UNLOADED, but convertToModelStatus() maps it to AVAILABLE so health/readiness and orchestration still treat the servable as serveable (it auto-reloads on demand) — mirroring how llama.cpp keeps is_sleeping out of /health.
  • Endpoint exemption is automatic: the last-activity timestamp is updated only on inference acquisition (MediapipeGraphDefinition::create()); status/health/metrics endpoints never go through that path, so health probes can't keep a model awake.
  • Concurrency (hardened — see review section): a per-definition recursive lifecycleMtx serializes reload() / retire() / unload() / wakeUpIfUnloaded(), so the watcher thread and the config thread can never race on the graph's state machine or side packets. unload() is non-blocking — it skips graphs with in-flight requests and tears down only after confirming the AVAILABLEUNLOADED transition actually happened. createPipeline() wakes an UNLOADED graph with a bounded retry, re-fetching the definition each iteration to avoid stale pointers. The idle timeout is cached in an atomic so the watcher never reads config lock-free. In-flight requests stay safe via the existing GraphIdGuard shared_ptr mechanism and the requestsHandlesCounter drain protocol.
  • Metrics: new ovms_graph_loaded gauge (1 loaded / 0 unloaded) per graph.
  • Composes with --cache_dir (--cache_dir not propagated to LLM continuous batching pipeline (regression vs 2025.4 promise) #4230) so wake-up is a cache import rather than a full recompile.

Scope

Phase 1 targets LLM continuous-batching graphs; graphs containing Python nodes are rejected at config validation when the timeout is set (Python teardown has GIL/threading concerns to address separately).

Testing

Built with MSVC and run locally (Windows, --config=win_mp_on_py_off).

Unit tests (model-free):

  • PipelineDefinitionStatus transitions: AVAILABLE→UNLOADED, UNLOADED→RELOADING→AVAILABLE, UNLOADED→RETIRED, UnloadEvent is a no-op on every other state, isAvailable() false for UNLOADED, convertToModelStatus()AVAILABLE.
  • Schema: idle_unload_timeout_seconds accepted; negative / wrong-type rejected.
  • Unload guard behavior: unload() is a no-op (no teardown) when state ≠ AVAILABLE (e.g. RELOADING/BEGIN), and skips when requests are in flight.

Functional tests (real facebook/opt-125m):

  • Unload-after-idle frees resources; wake-up reloads and serves; create() resets the idle timer; disabled-by-default never unloads; Python-node rejection (py-on builds); concurrent wake-up all end AVAILABLE; concurrent unload/reload/retire stress ends in a consistent state with no crash.

End-to-end on an Intel Arc B70 GPU:

  • Served an LLM with idle_unload_timeout_seconds: 8; the watcher unloaded it after the idle period (logged "freed GPU/CPU resources"), and the next request transparently woke it (wake-up completed in 831ms) and returned a valid completion.

Regression:

  • Ran the full ModelManager* and *Mediapipe* suites and compared against a clean-main baseline build: identical failure set (the only failures are pre-existing, caused by test model assets not present in this local environment) — zero new failures introduced by this change.

Concurrency review

The concurrency design was put through three independent adversarial review passes. The first surfaced a critical unconditional-teardown bug (the watcher could tear down a graph mid-reload) and several races; the fixes were re-reviewed, which caught a new data race introduced by relocating the idle sweep out of configMtx; that was fixed by the recursive lifecycleMtx serialization above. The final pass verified: lifecycle mutators are mutually exclusive, lock ordering is one-directional (configMtxlifecycleMtx, never the reverse → no deadlock), the drain-under-lock cannot deadlock (the only counter-decrementing path is a lock-free atomic released before inference and never touches lifecycleMtx), and the remaining items are LOW/benign — no CRITICAL/HIGH issues outstanding.

Notes

  • No automated CI runs on fork branches here (upstream uses Jenkins); the above was verified by local builds/tests.

@exzile exzile changed the title Idle model unload for mediapipe LLM graphs (#4141) Idle model unload for mediapipe LLM graphs (#4141) (WIP) Jun 26, 2026
@exzile exzile force-pushed the feature/idle-model-unload branch 2 times, most recently from 2e5439b to e9940c6 Compare June 26, 2026 23:01
@exzile

exzile commented Jun 26, 2026

Copy link
Copy Markdown
Author

Hardening + verification update

Since the initial version this PR underwent a deep correctness/concurrency pass. Summary of what changed and how it was verified.

Robustness improvements added

  • In-flight inference guard. A generation that outlives idle_unload_timeout_seconds is no longer unloaded mid-stream. An RAII guard held for the executor's lifetime keeps a shared_ptr inference counter; unload() skips while it is non-zero, and completing an inference refreshes the activity timestamp. The counter/timestamp are shared_ptr so they outlive the definition if an executor is still running during retire.
  • Wake-up failure is retryable. If a lazy reload fails (e.g. model files transiently unavailable, GPU OOM), the graph reverts to UNLOADED instead of a wedged failed state; the current request gets a clean error and the next request re-attempts the wake, self-healing once the issue is resolved.
  • Phase-1 scope restriction. idle_unload_timeout_seconds is accepted only for LLM continuous-batching graphs (HttpLLMCalculator); graphs with Python nodes or non-LLM calculators are rejected at config validation.
  • Concurrency model. A per-definition recursive lifecycleMtx serializes reload/retire/unload/wakeUp; unload() is non-blocking and tears down only after confirming the AVAILABLE→UNLOADED transition; the idle timeout is cached in an atomic for lock-free watcher reads.

Testing

Unit (model-free): state-machine transitions (AVAILABLE→UNLOADED, UNLOADED→RELOADING→AVAILABLE, UNLOADED→RETIRED, LOADING_PRECONDITION_FAILED→UNLOADED on failed wake); schema accept/reject; unload() no-op off the AVAILABLE state and skip while in-flight; ActiveInferenceGuard increment/decrement and exception-safety.

Functional (real facebook/opt-125m): unload-after-idle frees resources; wake-up reloads and serves; idle-timer reset; disabled-by-default; concurrent wake; concurrent unload/reload/retire stress; failed-wake-is-retryable; Python-node rejection (py-on build). Full suite green (116 model-free/functional + the py-on guard test).

End-to-end on an Intel Arc B70 GPU:

  • Idle unload after the timeout (logs "freed GPU/CPU resources"); wake-up on next request (~0.7s) returns a valid completion.
  • Unload-during-generation: a 4.3s generation with a 2s timeout stayed loaded throughout (ovms_graph_loaded=1) and unloaded only ~2s after completion — the in-flight guard holds.
  • Soak (10 wake/unload cycles): GPU VRAM bounded at 355–370 MB and host WS bounded — no leak. Loaded adds ~+102 MB GPU; on unload OVMS frees the model (metric→0, logs) while the Intel driver pools the VRAM and reuses it across cycles (the dedicated-usage counter does not drop on unload — known Intel/Windows behavior — but does not grow).
  • Metric ovms_graph_loaded observed via /metrics: 1 loaded / 0 unloaded.
  • Config reload + retire while UNLOADED: config reload leaves it serveable; removing it from config retires it cleanly (UNLOADED→RETIRED, request → 404, server healthy).
  • Wake-failure self-heal: broke the model → request returns a clean 500 ("Reverted to UNLOADED; next request will retry"); restored the model → next request returns 200 with no config reload.

gRPC/KFS path: the lazy-wake lives in ModelManager::createPipeline(), the single chokepoint called by both the HTTP /v3 handler and the KFS gRPC inference service, so the wake is transport-agnostic.

Regression: ran the full ModelManager* and *Mediapipe* suites and diffed against a clean-main baseline build — identical failure set (only pre-existing failures from test model assets absent in the local env); zero new failures.

Concurrency review. Reviewed adversarially across four passes; the final pass is a GO with no CRITICAL/HIGH issues. Lock ordering is one-directional (configMtx → lifecycleMtx), the drain loops cannot deadlock (the inference-completion path takes no lock), and the in-flight counter is provably ≥1 for the entire span from executor construction through inference completion (no zero-gap allowing mid-inference teardown).

Known limitations / follow-ups (non-blocking)

  • retire()/reload() (config-driven) do not wait for an in-flight inference to finish — this is pre-existing behavior (running inferences stay safe via the executor's own shared_ptr/value-copy of resources); the new in-flight guard intentionally protects only the idle-unload path.
  • Data-race fixes were verified by static reasoning + concurrency stress tests locally; ThreadSanitizer is the gold standard for the atomics/lock interactions and runs only on the Linux CI — recommended there.
  • No automated CI runs on fork branches here; the above was verified via local builds/tests (Windows/MSVC + Intel Arc GPU).

@exzile exzile changed the title Idle model unload for mediapipe LLM graphs (#4141) (WIP) Idle model unload for mediapipe LLM graphs (#4141) Jun 27, 2026
Adds an opt-in idle timeout that unloads an LLM graph's heavy resources
(freeing GPU/CPU memory) after a period with no inference, and lazily
reloads on the next request, so the GPU can be shared with other
workloads. Follows llama.cpp's --sleep-idle-seconds model.

- Config: idle_unload_timeout_seconds on the mediapipe config entry
  (0 = disabled, default). Added to the JSON schema. Phase-1 scope:
  restricted to LLM continuous-batching graphs (HttpLLMCalculator);
  rejected for graphs with Python nodes or non-LLM calculators at config
  validation.
- State machine: new UNLOADED state + UnloadEvent. AVAILABLE->UNLOADED on
  idle; UNLOADED wakes via the existing reload path. isAvailable() is false
  for UNLOADED, convertToModelStatus() maps it to AVAILABLE so health/
  readiness still see the servable (it auto-reloads).
- Concurrency: a per-definition recursive lifecycleMtx serializes
  reload/retire/unload/wakeUp so the watcher and config threads never race
  on graph state or side packets. unload() is non-blocking, tears down only
  after confirming the AVAILABLE->UNLOADED transition, and skips while
  requests or inferences are in flight. The idle timeout is cached in an
  atomic for lock-free watcher reads.
- In-flight guard: an RAII ActiveInferenceGuard (held for the executor's
  lifetime) keeps a shared_ptr inference counter so a generation that
  outlives the idle timeout is never unloaded mid-stream; completing an
  inference refreshes the activity timestamp. lastActivityTimeNs and the
  counter are shared_ptr so they outlive the definition if an executor is
  still running during retire.
- Wake-up failure is retryable: if the lazy reload fails, the graph reverts
  to UNLOADED (not a wedged failed state) so the next request re-attempts
  the wake and self-heals once the underlying issue is resolved; the
  current request gets a clean error.
- Metrics: ovms_graph_loaded gauge (1 loaded / 0 unloaded) per graph.
- Composes with --cache_dir so wake-up is a cache import, not a recompile.

Tested: state-machine + schema unit tests; in-flight-guard and wake-failure
unit tests; functional unload/reload/idle-reset/disabled-default/concurrency
with a real model; and end-to-end on an Intel Arc GPU (idle unload frees
resources, long generations are not unloaded mid-stream, wake-up reloads and
serves, soak shows no leak, failed wake self-heals). No regressions vs main.

Implements openvinotoolkit#4141

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@exzile exzile force-pushed the feature/idle-model-unload branch from e9940c6 to 65f2bab Compare June 27, 2026 01:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unload model from memory/GPU when idle for set period

1 participant