Skip to content

docker model run fails on Docker Offload (background model preload failed: preload failed: status=500 body=unable to load runner) #709

@k33g

Description

@k33g

I'm trying to run models on offload (GPU support is enabled), but I'm getting this error: background model preload failed: preload failed: status=500 body=unable to load runner

I tried with these models:

  • ai/qwen3:latest
  • huggingface.co/menlo/jan-nano-128k-gguf:Q4_K_M
  • huggingface.co/menlo/lucy-128k-gguf:Q4_K_M
  • huggingface.co/unsloth/qwen3-30b-a3b-instruct-2507-gguf:Q4_K_M

ai/qwen3:latest:

docker model run ai/qwen3:latest
> hello
background model preload failed: preload failed: status=500 body=unable to load runner: error waiting for runner to be ready: llama.cpp terminated unexpectedly: llama.cpp failed: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4455.34 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 4671766528
llama_model_load: error loading model: unable to allocate CUDA0 buffer
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/models/bundles/sha256/79fa56c07429f64f41950fe2f524937cf3ae9ea9bd3d7ada72170b036ea3cc85/model/model.gguf'
srv    load_model: failed to load model, '/models/bundles/sha256/79fa56c07429f64f41950fe2f524937cf3ae9ea9bd3d7ada72170b036ea3cc85/model/model.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error
nd|>'
print_info: EOG token             = 151662 '<|fim_pad|>'
print_info: EOG token             = 151663 '<|repo_name|>'
print_info: EOG token             = 151664 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)

huggingface.co/menlo/jan-nano-128k-gguf:Q4_K_M:

docker model run huggingface.co/menlo/jan-nano-128k-gguf:Q4_K_M
> hello
background model preload failed: preload failed: status=500 body=unable to load runner: error waiting for runner to be ready: llama.cpp terminated unexpectedly: llama.cpp failed: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2375.91 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 2491323904
llama_model_load: error loading model: unable to allocate CUDA0 buffer
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/models/bundles/sha256/5bccc713c51a9a716cc68ebfec3206a5387a853b8116aca43b12f2051395227b/model/model.gguf'
srv    load_model: failed to load model, '/models/bundles/sha256/5bccc713c51a9a716cc68ebfec3206a5387a853b8116aca43b12f2051395227b/model/model.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error
nd|>'
print_info: EOG token             = 151662 '<|fim_pad|>'
print_info: EOG token             = 151663 '<|repo_name|>'
print_info: EOG token             = 151664 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)

huggingface.co/menlo/lucy-128k-gguf:Q4_K_M:

docker model run huggingface.co/menlo/lucy-128k-gguf:Q4_K_M
> hello
background model preload failed: preload failed: status=500 body=unable to load runner: error waiting for runner to be ready: llama.cpp terminated unexpectedly: llama.cpp failed: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1050.43 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 1101457408
llama_model_load: error loading model: unable to allocate CUDA0 buffer
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/models/bundles/sha256/5d3943155c1efc5da8fe16f5bca8163d97d9c632c2cc4167e66fd58adef5a6e3/model/model.gguf'
srv    load_model: failed to load model, '/models/bundles/sha256/5d3943155c1efc5da8fe16f5bca8163d97d9c632c2cc4167e66fd58adef5a6e3/model/model.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error
nd|>'
print_info: EOG token             = 151662 '<|fim_pad|>'
print_info: EOG token             = 151663 '<|repo_name|>'
print_info: EOG token             = 151664 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)

huggingface.co/unsloth/qwen3-30b-a3b-instruct-2507-gguf:Q4_K_M:

docker model run huggingface.co/unsloth/qwen3-30b-a3b-instruct-2507-gguf:Q4_K_M
> hello
Failed to generate a response: error response: status=500 body=unable to load runner: error waiting for runner to be ready: llama.cpp terminated unexpectedly: llama.cpp failed: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 17524.43 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 18375698432
llama_model_load: error loading model: unable to allocate CUDA0 buffer
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/models/bundles/sha256/d9413b1eba770e2a7fe89c81f2f8b9474b8891b7e24eafcd74603b00114e26d0/model/model.gguf'
srv    load_model: failed to load model, '/models/bundles/sha256/d9413b1eba770e2a7fe89c81f2f8b9474b8891b7e24eafcd74603b00114e26d0/model/model.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error
|>'
print_info: EOG token             = 151662 '<|fim_pad|>'
print_info: EOG token             = 151663 '<|repo_name|>'
print_info: EOG token             = 151664 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions