-
Notifications
You must be signed in to change notification settings - Fork 99
Closed
Description
I'm trying to run models on offload (GPU support is enabled), but I'm getting this error: background model preload failed: preload failed: status=500 body=unable to load runner
I tried with these models:
- ai/qwen3:latest
- huggingface.co/menlo/jan-nano-128k-gguf:Q4_K_M
- huggingface.co/menlo/lucy-128k-gguf:Q4_K_M
- huggingface.co/unsloth/qwen3-30b-a3b-instruct-2507-gguf:Q4_K_M
ai/qwen3:latest:
docker model run ai/qwen3:latest
> hello
background model preload failed: preload failed: status=500 body=unable to load runner: error waiting for runner to be ready: llama.cpp terminated unexpectedly: llama.cpp failed: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4455.34 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 4671766528
llama_model_load: error loading model: unable to allocate CUDA0 buffer
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/models/bundles/sha256/79fa56c07429f64f41950fe2f524937cf3ae9ea9bd3d7ada72170b036ea3cc85/model/model.gguf'
srv load_model: failed to load model, '/models/bundles/sha256/79fa56c07429f64f41950fe2f524937cf3ae9ea9bd3d7ada72170b036ea3cc85/model/model.gguf'
srv operator(): operator(): cleaning up before exit...
main: exiting due to model loading error
nd|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)huggingface.co/menlo/jan-nano-128k-gguf:Q4_K_M:
docker model run huggingface.co/menlo/jan-nano-128k-gguf:Q4_K_M
> hello
background model preload failed: preload failed: status=500 body=unable to load runner: error waiting for runner to be ready: llama.cpp terminated unexpectedly: llama.cpp failed: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2375.91 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 2491323904
llama_model_load: error loading model: unable to allocate CUDA0 buffer
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/models/bundles/sha256/5bccc713c51a9a716cc68ebfec3206a5387a853b8116aca43b12f2051395227b/model/model.gguf'
srv load_model: failed to load model, '/models/bundles/sha256/5bccc713c51a9a716cc68ebfec3206a5387a853b8116aca43b12f2051395227b/model/model.gguf'
srv operator(): operator(): cleaning up before exit...
main: exiting due to model loading error
nd|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)huggingface.co/menlo/lucy-128k-gguf:Q4_K_M:
docker model run huggingface.co/menlo/lucy-128k-gguf:Q4_K_M
> hello
background model preload failed: preload failed: status=500 body=unable to load runner: error waiting for runner to be ready: llama.cpp terminated unexpectedly: llama.cpp failed: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1050.43 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 1101457408
llama_model_load: error loading model: unable to allocate CUDA0 buffer
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/models/bundles/sha256/5d3943155c1efc5da8fe16f5bca8163d97d9c632c2cc4167e66fd58adef5a6e3/model/model.gguf'
srv load_model: failed to load model, '/models/bundles/sha256/5d3943155c1efc5da8fe16f5bca8163d97d9c632c2cc4167e66fd58adef5a6e3/model/model.gguf'
srv operator(): operator(): cleaning up before exit...
main: exiting due to model loading error
nd|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)huggingface.co/unsloth/qwen3-30b-a3b-instruct-2507-gguf:Q4_K_M:
docker model run huggingface.co/unsloth/qwen3-30b-a3b-instruct-2507-gguf:Q4_K_M
> hello
Failed to generate a response: error response: status=500 body=unable to load runner: error waiting for runner to be ready: llama.cpp terminated unexpectedly: llama.cpp failed: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 17524.43 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 18375698432
llama_model_load: error loading model: unable to allocate CUDA0 buffer
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/models/bundles/sha256/d9413b1eba770e2a7fe89c81f2f8b9474b8891b7e24eafcd74603b00114e26d0/model/model.gguf'
srv load_model: failed to load model, '/models/bundles/sha256/d9413b1eba770e2a7fe89c81f2f8b9474b8891b7e24eafcd74603b00114e26d0/model/model.gguf'
srv operator(): operator(): cleaning up before exit...
main: exiting due to model loading error
|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels