Skip to content

auto-fit vs unloading with threshold #8562

@mudler

Description

@mudler

As an extra note, I noticed by default this auto-fit logic from llama-cpp is bypassed because by default we pass a large value for gpu_layers (not sure if extra logic is involved, but that's what I saw for my model config).

I was able to enable auto-fit by setting gpu_layers = -1 in my model config which is associated to llama-cpp's auto-fit flag value, at which point I saw the tensor_buft_override buffer error which this PR addresses.

Curious if I should update the markdowns to advertise enabling auto-fit by setting gpu_layers=-1 for anyone interested in leveraging it, or if we should make changes to the default logic. @mudler thoughts?

I'm still weighing the best approach here due to a subtle trade-off. LocalAI already unloads models based on a VRAM usage threshold, and enabling auto-fitting by default in llama.cpp would conflict with this mechanism. This requires either custom logic or a runtime toggle to fix correctly. In the short term, documenting the limitation is the best path forward.

Originally posted by @mudler in #8560 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions