-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
As an extra note, I noticed by default this auto-fit logic from llama-cpp is bypassed because by default we pass a large value for gpu_layers (not sure if extra logic is involved, but that's what I saw for my model config).
I was able to enable auto-fit by setting
gpu_layers = -1in my model config which is associated to llama-cpp's auto-fit flag value, at which point I saw thetensor_buft_overridebuffer error which this PR addresses.Curious if I should update the markdowns to advertise enabling auto-fit by setting
gpu_layers=-1for anyone interested in leveraging it, or if we should make changes to the default logic. @mudler thoughts?
I'm still weighing the best approach here due to a subtle trade-off. LocalAI already unloads models based on a VRAM usage threshold, and enabling auto-fitting by default in llama.cpp would conflict with this mechanism. This requires either custom logic or a runtime toggle to fix correctly. In the short term, documenting the limitation is the best path forward.
Originally posted by @mudler in #8560 (comment)