Summary
Add intelligent automatic VRAM management for Ollama models — preloading models into VRAM before they're requested and evicting them when VRAM pressure is high.
Current Behavior
Ollama loads models into VRAM on first inference request (cold start adds several seconds). Models stay resident until Ollama's built-in LRU eviction kicks them out. Users have no control over which models are warm or how VRAM is budgeted.
Desired Behavior
- Warm model list: User designates models that should always be loaded in VRAM when the system is idle
- Preloading: On boot or after a model download, warm models are automatically loaded without waiting for a chat request
- VRAM budget: User sets a VRAM reservation (e.g. "keep 2 GB free") so the system doesn't over-commit
- Priority-based eviction: When VRAM is needed, evict least-recently-used non-warm models first; warm models are evicted last
- TUI integration: Show VRAM budget and warm model configuration in the Models screen or a dedicated VRAM management screen
Technical Notes
- Ollama's
POST /api/generate with keep_alive controls per-model VRAM residency
GET /api/ps shows currently loaded models with VRAM usage
nvidia-smi provides total/used VRAM for budget enforcement
- Could be implemented as a lightweight background service or within the TUI's periodic refresh loop
Related
- Manual load/unload is being added to the TUI Models screen (current sprint)
- VRAM display is being added to the Dashboard (current sprint)
Priority
Nice to have — the manual load/unload covers the immediate need. This is the automated layer on top.
Summary
Add intelligent automatic VRAM management for Ollama models — preloading models into VRAM before they're requested and evicting them when VRAM pressure is high.
Current Behavior
Ollama loads models into VRAM on first inference request (cold start adds several seconds). Models stay resident until Ollama's built-in LRU eviction kicks them out. Users have no control over which models are warm or how VRAM is budgeted.
Desired Behavior
Technical Notes
POST /api/generatewithkeep_alivecontrols per-model VRAM residencyGET /api/psshows currently loaded models with VRAM usagenvidia-smiprovides total/used VRAM for budget enforcementRelated
Priority
Nice to have — the manual load/unload covers the immediate need. This is the automated layer on top.