Skip to content

feat: add GGUF/GGML model worker using llama-cpp-python#3827

Open
crawfordxx wants to merge 1 commit intolm-sys:mainfrom
crawfordxx:feat-gguf-model-support
Open

feat: add GGUF/GGML model worker using llama-cpp-python#3827
crawfordxx wants to merge 1 commit intolm-sys:mainfrom
crawfordxx:feat-gguf-model-support

Conversation

@crawfordxx
Copy link
Copy Markdown

Summary

Add a new model worker (fastchat/serve/gguf_worker.py) that loads quantized GGUF/GGML models via llama-cpp-python, enabling FastChat to serve locally quantized models with reduced memory requirements.

Closes #2410

Changes

  • New GGUFWorker class extending BaseModelWorker, following the same patterns as mlx_worker.py and vllm_worker.py
  • llama-cpp-python is an optional dependency -- only imported when using the GGUF worker
  • Supports streaming and non-streaming text generation
  • Supports standard generation parameters: temperature, top_p, top_k, max_new_tokens, repeat_penalty, presence_penalty, frequency_penalty
  • GPU layer offloading via --n-gpu-layers (set to -1 to offload all layers)
  • Configurable context length via --context-len
  • Controller registration and heartbeat

Usage

# Install dependency
pip install llama-cpp-python

# Start the worker
python3 -m fastchat.serve.gguf_worker \
    --model-path /path/to/model.gguf \
    --model-names my-model \
    --controller-address http://localhost:21001 \
    --n-gpu-layers -1

Test plan

  • Install llama-cpp-python and download a small GGUF model
  • Start controller, then start gguf_worker pointing at the model
  • Verify worker registers with controller (/list_models)
  • Test streaming generation via /worker_generate_stream
  • Test non-streaming generation via /worker_generate
  • Verify heartbeat keeps worker registered
  • Test with --n-gpu-layers -1 for GPU offloading

Add a new model worker (gguf_worker.py) that loads quantized GGUF/GGML
models via llama-cpp-python, enabling FastChat to serve locally quantized
models with reduced memory requirements.

The worker follows the same pattern as existing workers (mlx_worker,
vllm_worker) and supports:
- Streaming and non-streaming generation
- Temperature, top_p, top_k, repeat_penalty parameters
- GPU layer offloading via --n-gpu-layers
- Configurable context length
- Controller registration and heartbeat

Closes lm-sys#2410
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature request] Support loading GGUF and GGML model format

1 participant