auto-fit vs unloading with threshold

> As an extra note, I noticed by default this auto-fit logic from llama-cpp is bypassed because by default we pass [a large value for gpu_layers](https://github.com/mudler/LocalAI/blob/2fb9940b8ad63f03a033993048144fd632ea4604/core/backend/options.go#L122) (not sure if extra logic is involved, but that's what I saw for my model config).
> 
> I was able to enable auto-fit by setting `gpu_layers = -1` in my model config which is associated to llama-cpp's auto-fit flag value, at which point I saw the `tensor_buft_override` buffer error which this PR addresses.
> 
> Curious if I should update the markdowns to advertise enabling auto-fit by setting `gpu_layers=-1` for anyone interested in leveraging it, or if we should make changes to the default logic. @mudler thoughts?

I'm still weighing the best approach here due to a subtle trade-off. LocalAI already unloads models based on a VRAM usage threshold, and enabling auto-fitting by default in llama.cpp would conflict with this mechanism. This requires either custom logic or a runtime toggle to fix correctly. In the short term, documenting the limitation is the best path forward.

_Originally posted by @mudler in https://github.com/mudler/LocalAI/issues/8560#issuecomment-3901532067_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

auto-fit vs unloading with threshold #8562

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

auto-fit vs unloading with threshold #8562

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions