Skip to content

feat: add GGUF/GGML model support via llama-cpp-python#3826

Closed
crawfordxx wants to merge 1 commit intolm-sys:mainfrom
crawfordxx:feat-gguf-model-support
Closed

feat: add GGUF/GGML model support via llama-cpp-python#3826
crawfordxx wants to merge 1 commit intolm-sys:mainfrom
crawfordxx:feat-gguf-model-support

Conversation

@crawfordxx
Copy link
Copy Markdown

Summary

Adds a new model worker backend (gguf_worker.py) that uses llama-cpp-python to load and serve quantized GGUF/GGML models through FastChat.

  • New file: fastchat/serve/gguf_worker.py — follows the same worker interface pattern as mlx_worker.py and vllm_worker.py
  • Optional dependency: llama-cpp-python>=0.2.0 added to pyproject.toml under [gguf] extra
  • Documentation: docs/gguf_integration.md with setup and usage instructions

Features

  • Streaming and non-streaming text generation
  • Configurable GPU layer offloading (--n-gpu-layers)
  • Configurable context window (--n-ctx) and batch size (--n-batch)
  • Standard FastChat worker lifecycle (heartbeat, controller registration, semaphore concurrency)
  • Compatible with the OpenAI-compatible API server

Usage

pip install "llama-cpp-python>=0.2.0"

python3 -m fastchat.serve.gguf_worker \
    --model-path ./models/llama-2-7b-chat.Q4_K_M.gguf \
    --model-names llama-2-7b-chat \
    --conv-template llama-2 \
    --n-gpu-layers -1

Closes #2410

Add a new model worker backend that uses llama-cpp-python to load and
serve quantized GGUF/GGML models. This enables running large language
models efficiently on CPU or with partial GPU offloading.

- Add gguf_worker.py following the same pattern as mlx_worker/vllm_worker
- Support streaming and non-streaming text generation
- Support configurable GPU offloading (n_gpu_layers), context size, batch size
- Add llama-cpp-python as an optional dependency in pyproject.toml
- Add integration documentation in docs/gguf_integration.md

Closes lm-sys#2410
@crawfordxx crawfordxx closed this Apr 1, 2026
@crawfordxx crawfordxx deleted the feat-gguf-model-support branch April 1, 2026 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature request] Support loading GGUF and GGML model format

1 participant