Skip to content

feat: add ErrorTransformer for llama.cpp#729

Merged
doringeman merged 1 commit intodocker:mainfrom
doringeman:llamacpp-error-transformer
Mar 4, 2026
Merged

feat: add ErrorTransformer for llama.cpp#729
doringeman merged 1 commit intodocker:mainfrom
doringeman:llamacpp-error-transformer

Conversation

@doringeman
Copy link
Contributor

Add ErrorTransformer for llama.cpp to surface friendly error messages.

E.g.,

$ MODEL_RUNNER_HOST=http://localhost:8080/ docker model run gpt-oss hi
Failed to generate a response: error response: status=500 body=unable to load runner: error waiting for runner to be ready: llama.cpp terminated unexpectedly: llama.cpp failed: not enough GPU memory to load the model (CUDA)

Logs:

time=2026-03-03T21:33:19.366Z level=INFO msg="load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)"
time=2026-03-03T21:36:19.179Z level=INFO msg="ggml_backend_cuda_buffer_type_alloc_buffer: allocating 10723.15 MiB on device 0: cudaMalloc failed: out of memory"
time=2026-03-03T21:36:19.181Z level=INFO msg="alloc_tensor_range: failed to allocate CUDA0 buffer of size 11244037120"
time=2026-03-03T21:36:20.133Z level=INFO msg="llama_model_load: error loading model: unable to allocate CUDA0 buffer"
time=2026-03-03T21:36:20.133Z level=INFO msg="llama_model_load_from_file_impl: failed to load model"
time=2026-03-03T21:36:21.070Z level=INFO msg="common_init_from_params: failed to load model '/models/bundles/sha256/9398339cb0d3b150931212377c16d2a105ddce053ec187e4397ba6e10f3ea112/model/model.gguf'"
time=2026-03-03T21:36:21.070Z level=INFO msg="srv    load_model: failed to load model, '/models/bundles/sha256/9398339cb0d3b150931212377c16d2a105ddce053ec187e4397ba6e10f3ea112/model/model.gguf'"
time=2026-03-03T21:36:21.070Z level=INFO msg="srv    operator(): operator(): cleaning up before exit..."
time=2026-03-03T21:36:21.077Z level=INFO msg="main: exiting due to model loading error"
time=2026-03-03T21:36:21.933Z level=WARN msg="Backend running model exited with error" backend=llama.cpp model=gpt-oss error="llama.cpp terminated unexpectedly: llama.cpp failed: not enough GPU memory to load the model (CUDA)"
time=2026-03-03T21:36:22.342Z level=INFO msg="getting model by reference" component=model-manager reference=sha256:9398339cb0d3b150931212377c16d2a105ddce053ec187e4397ba6e10f3ea112
time=2026-03-03T21:36:22.343Z level=INFO msg="Listing available models" component=model-manager
time=2026-03-03T21:36:22.362Z level=INFO msg="successfully listed models" component=model-manager count=1
time=2026-03-03T21:36:22.371Z level=INFO msg="Removed records for model" component=openai-recorder model=sha256:9398339cb0d3b150931212377c16d2a105ddce053ec187e4397ba6e10f3ea112
time=2026-03-03T21:36:22.371Z level=WARN msg="Backend runner initialization failed" backend=llama.cpp model=sha256:9398339cb0d3b150931212377c16d2a105ddce053ec187e4397ba6e10f3ea112 mode=completion error="llama.cpp terminated unexpectedly: llama.cpp failed: not enough GPU memory to load the model (CUDA)"

Improves UX for #709.

Signed-off-by: Dorin Geman <dorin.geman@docker.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an ErrorTransformer for the llama.cpp backend to improve error reporting. A new errors.go file defines regular expressions to catch common llama.cpp errors and replace them with more user-friendly messages. This is accompanied by unit tests in errors_test.go. The new error transformer is then integrated into the main llama.cpp backend logic. The changes are well-structured and effectively address the goal of providing clearer error feedback to users.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • Consider making the regex patterns slightly more robust (e.g., using case-insensitive flags or matching just the key terms like failed to allocate buffer / out of memory) so they continue to work if upstream log wording changes slightly.
  • Returning the full stderr output when no pattern matches may surface very noisy logs to the user; you might want to truncate or sanitize the fallback message, or prepend a short generic summary and include the raw output only for debugging.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider making the regex patterns slightly more robust (e.g., using case-insensitive flags or matching just the key terms like `failed to allocate buffer` / `out of memory`) so they continue to work if upstream log wording changes slightly.
- Returning the full stderr output when no pattern matches may surface very noisy logs to the user; you might want to truncate or sanitize the fallback message, or prepend a short generic summary and include the raw output only for debugging.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@doringeman doringeman merged commit 9039311 into docker:main Mar 4, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants