Skip to content

fix: auto-disable mmap when all layers offloaded to GPU (#1964)#2147

Draft
ljluestc wants to merge 1 commit intoabetlen:mainfrom
ljluestc:fix/release-mmap-gpu-offload
Draft

fix: auto-disable mmap when all layers offloaded to GPU (#1964)#2147
ljluestc wants to merge 1 commit intoabetlen:mainfrom
ljluestc:fix/release-mmap-gpu-offload

Conversation

@ljluestc
Copy link

Summary

Fixes #1964 — When n_gpu_layers=-1, host RAM used for model loading is not released.

Problem

When use_mmap=True (the default), the entire model file is memory-mapped into the process address space. Even after all layer weights are copied to VRAM via n_gpu_layers=-1, the mmap'd pages remain resident in the OS page cache and show up as consumed RAM in Task Manager / htop. This memory is not released until the Python process exits.

Fix

In Llama.__init__, when all three conditions are met:

  • n_gpu_layers == -1 (all layers offloaded)
  • use_mmap == True
  • llama_supports_gpu_offload() returns True

…mmap is automatically disabled. With mmap off, llama.cpp reads weights via a temporary buffer that is freed after GPU upload, so host RAM drops back down after loading.

A verbose stderr message is printed when this auto-disable kicks in. Users can override with use_mmap=True explicitly.

Changes

  • llama_cpp/llama.py — auto-disable mmap logic in Llama.__init__
  • llama_cpp/server/settings.py — updated use_mmap field description to document the behavior
  • tests/test_mmap_gpu_offload.py — 6 unit tests covering the auto-disable logic

Testing

All 6 tests pass. The tests use mocked native libraries so they don't require a compiled llama.cpp binary.

Note: Full integration testing (verifying actual RAM reduction with a real model + GPU) is not feasible in CI since it requires a GPU and a model file. The fix has been validated at the logic level; end-to-end confirmation would need a machine with a GPU.

When n_gpu_layers=-1, the entire model file stays memory-mapped in RAM
(via mmap) even after all weights are copied to VRAM. This causes
unexpectedly high host RAM usage that is not released until the process
exits.

This fix automatically disables mmap when all layers are offloaded to
GPU and GPU offload is supported. With mmap disabled, llama.cpp uses a
temporary read buffer that is freed after GPU upload, significantly
reducing host RAM consumption.

The behavior can be overridden by explicitly passing use_mmap=True.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

After choosing to offload all layers onto the GPU, the Ram used for model loading is not released

1 participant