Skip to content

feat: add fastllm worker for high-performance inference#3828

Open
crawfordxx wants to merge 1 commit intolm-sys:mainfrom
crawfordxx:feat-fastllm-worker
Open

feat: add fastllm worker for high-performance inference#3828
crawfordxx wants to merge 1 commit intolm-sys:mainfrom
crawfordxx:feat-fastllm-worker

Conversation

@crawfordxx
Copy link
Copy Markdown

Summary

Add a new model worker backend based on fastllm, addressing #2521.

  • New file: fastchat/serve/fastllm_worker.py — follows the BaseModelWorker pattern (consistent with vllm_worker.py, mlx_worker.py, etc.)
  • Streaming generation with support for temperature, top_p, top_k, repetition_penalty, max_new_tokens, and stop strings
  • Quantization: --dtype flag supports float16, float32, int8, int4
  • CPU threading: --threads flag for CPU-bound inference
  • Optional dependency: fastllm is imported only at runtime; clear error message if not installed
  • Documentation: docs/fastllm_integration.md with installation and usage instructions

Usage

# Launch controller
python3 -m fastchat.serve.controller

# Launch fastllm worker
python3 -m fastchat.serve.fastllm_worker \
    --model-path chatglm2-6b \
    --dtype int8 \
    --threads 8

Test Plan

  • Verify worker starts and registers with controller
  • Test streaming generation via /worker_generate_stream
  • Test non-streaming generation via /worker_generate
  • Test with different --dtype options (float16, int8, int4)
  • Test stop string handling
  • Verify heartbeat registration with controller

Closes #2521

Add a new model worker backend based on fastllm (https://github.com/ztxz16/fastllm),
a high-performance LLM inference engine with strong CPU acceleration.

- Add fastchat/serve/fastllm_worker.py following BaseModelWorker pattern
- Support streaming generation with temperature, top_p, top_k, repeat_penalty
- Support HuggingFace and .flm model formats
- Support int4/int8/float16/float32 quantization via --dtype flag
- fastllm is an optional dependency
- Add docs/fastllm_integration.md with setup instructions

Closes lm-sys#2521
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Please support fastllm

1 participant