feat: add GGUF/GGML model worker using llama-cpp-python by crawfordxx · Pull Request #3827 · lm-sys/FastChat

crawfordxx · 2026-04-01T15:13:12Z

Summary

Add a new model worker (fastchat/serve/gguf_worker.py) that loads quantized GGUF/GGML models via llama-cpp-python, enabling FastChat to serve locally quantized models with reduced memory requirements.

Closes #2410

Changes

New GGUFWorker class extending BaseModelWorker, following the same patterns as mlx_worker.py and vllm_worker.py
llama-cpp-python is an optional dependency -- only imported when using the GGUF worker
Supports streaming and non-streaming text generation
Supports standard generation parameters: temperature, top_p, top_k, max_new_tokens, repeat_penalty, presence_penalty, frequency_penalty
GPU layer offloading via --n-gpu-layers (set to -1 to offload all layers)
Configurable context length via --context-len
Controller registration and heartbeat

Usage

# Install dependency
pip install llama-cpp-python

# Start the worker
python3 -m fastchat.serve.gguf_worker \
    --model-path /path/to/model.gguf \
    --model-names my-model \
    --controller-address http://localhost:21001 \
    --n-gpu-layers -1

Test plan

Install llama-cpp-python and download a small GGUF model
Start controller, then start gguf_worker pointing at the model
Verify worker registers with controller (/list_models)
Test streaming generation via /worker_generate_stream
Test non-streaming generation via /worker_generate
Verify heartbeat keeps worker registered
Test with --n-gpu-layers -1 for GPU offloading

Add a new model worker (gguf_worker.py) that loads quantized GGUF/GGML models via llama-cpp-python, enabling FastChat to serve locally quantized models with reduced memory requirements. The worker follows the same pattern as existing workers (mlx_worker, vllm_worker) and supports: - Streaming and non-streaming generation - Temperature, top_p, top_k, repeat_penalty parameters - GPU layer offloading via --n-gpu-layers - Configurable context length - Controller registration and heartbeat Closes lm-sys#2410

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add GGUF/GGML model worker using llama-cpp-python#3827

feat: add GGUF/GGML model worker using llama-cpp-python#3827
crawfordxx wants to merge 1 commit intolm-sys:mainfrom
crawfordxx:feat-gguf-model-support

crawfordxx commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

crawfordxx commented Apr 1, 2026

Summary

Changes

Usage

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant