fix: auto-disable mmap when all layers offloaded to GPU (#1964) by ljluestc · Pull Request #2147 · abetlen/llama-cpp-python

ljluestc · 2026-03-22T15:42:17Z

Summary

Fixes #1964 — When n_gpu_layers=-1, host RAM used for model loading is not released.

Problem

When use_mmap=True (the default), the entire model file is memory-mapped into the process address space. Even after all layer weights are copied to VRAM via n_gpu_layers=-1, the mmap'd pages remain resident in the OS page cache and show up as consumed RAM in Task Manager / htop. This memory is not released until the Python process exits.

Fix

In Llama.__init__, when all three conditions are met:

n_gpu_layers == -1 (all layers offloaded)
use_mmap == True
llama_supports_gpu_offload() returns True

…mmap is automatically disabled. With mmap off, llama.cpp reads weights via a temporary buffer that is freed after GPU upload, so host RAM drops back down after loading.

A verbose stderr message is printed when this auto-disable kicks in. Users can override with use_mmap=True explicitly.

Changes

llama_cpp/llama.py — auto-disable mmap logic in Llama.__init__
llama_cpp/server/settings.py — updated use_mmap field description to document the behavior
tests/test_mmap_gpu_offload.py — 6 unit tests covering the auto-disable logic

Testing

All 6 tests pass. The tests use mocked native libraries so they don't require a compiled llama.cpp binary.

Note: Full integration testing (verifying actual RAM reduction with a real model + GPU) is not feasible in CI since it requires a GPU and a model file. The fix has been validated at the logic level; end-to-end confirmation would need a machine with a GPU.

When n_gpu_layers=-1, the entire model file stays memory-mapped in RAM (via mmap) even after all weights are copied to VRAM. This causes unexpectedly high host RAM usage that is not released until the process exits. This fix automatically disables mmap when all layers are offloaded to GPU and GPU offload is supported. With mmap disabled, llama.cpp uses a temporary read buffer that is freed after GPU upload, significantly reducing host RAM consumption. The behavior can be overridden by explicitly passing use_mmap=True.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: auto-disable mmap when all layers offloaded to GPU (#1964)#2147

fix: auto-disable mmap when all layers offloaded to GPU (#1964)#2147
ljluestc wants to merge 1 commit into
abetlen:mainfrom
ljluestc:fix/release-mmap-gpu-offload

ljluestc commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ljluestc commented Mar 22, 2026

Summary

Problem

Fix

Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant