Skip to content

On-demand automatic model loading/unloading (VRAM management) #14

@eshork

Description

@eshork

Summary

Add intelligent automatic VRAM management for Ollama models — preloading models into VRAM before they're requested and evicting them when VRAM pressure is high.

Current Behavior

Ollama loads models into VRAM on first inference request (cold start adds several seconds). Models stay resident until Ollama's built-in LRU eviction kicks them out. Users have no control over which models are warm or how VRAM is budgeted.

Desired Behavior

  • Warm model list: User designates models that should always be loaded in VRAM when the system is idle
  • Preloading: On boot or after a model download, warm models are automatically loaded without waiting for a chat request
  • VRAM budget: User sets a VRAM reservation (e.g. "keep 2 GB free") so the system doesn't over-commit
  • Priority-based eviction: When VRAM is needed, evict least-recently-used non-warm models first; warm models are evicted last
  • TUI integration: Show VRAM budget and warm model configuration in the Models screen or a dedicated VRAM management screen

Technical Notes

  • Ollama's POST /api/generate with keep_alive controls per-model VRAM residency
  • GET /api/ps shows currently loaded models with VRAM usage
  • nvidia-smi provides total/used VRAM for budget enforcement
  • Could be implemented as a lightweight background service or within the TUI's periodic refresh loop

Related

  • Manual load/unload is being added to the TUI Models screen (current sprint)
  • VRAM display is being added to the Dashboard (current sprint)

Priority

Nice to have — the manual load/unload covers the immediate need. This is the automated layer on top.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions