MLX Continuous Batching

OpenAI-compatible inference server with true continuous batching for Apple Silicon (M1/M2/M3/M4).

Features

True Continuous Batching: New requests join the active batch mid-generation
Immediate Response: Completed requests are returned instantly without waiting for the batch to finish
OpenAI-Compatible API: Drop-in replacement for OpenAI API
High Throughput: Up to 150 tok/s aggregate throughput on M3 Ultra with Gemma 27B

Why Continuous Batching?

Traditional static batching waits for ALL requests in a batch to complete before returning any responses. This means short requests wait for long ones.

With continuous batching:

Short requests complete in ~3s
Long requests complete in ~10s
Each request returns as soon as it's done

Static Batching :
  Short req: ████░░░░░░░░░░░░░░░░ waits... ────► returns at 10s
  Long req:  ████████████████████ ────────────► returns at 10s

Continuous Batching :
  Short req: ████ ───► returns at 3s
  Long req:  ████████████████████ ────────────► returns at 10s

Installation

From Source (Recommended)

git clone https://github.com/maxime-dlabai/mlx-continuous-batching.git
cd mlx-continuous-batching
pip install -e .

Requirements

macOS with Apple Silicon (M1/M2/M3/M4)
Python 3.10+
mlx >= 0.21.0
mlx-lm >= 0.20.0

Quick Start

Start the Server

# With a local model
mlx-cb-serve --model /path/to/mlx-model --port 8080

# With a HuggingFace model (downloads automatically)
mlx-cb-serve --model mlx-community/gemma-3-27b-it-qat-4bit --port 8080

# With custom batch size (for high-memory systems like M3 Ultra)
mlx-cb-serve --model mlx-community/gemma-3-27b-it-qat-4bit --max-batch-size 128

Use the API

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="mlx-community/gemma-3-27b-it-qat-4bit",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=100
)
print(response.choices[0].message.content)

With curl

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

API Endpoints

Endpoint	Method	Description
`/v1/chat/completions`	POST	OpenAI-compatible chat completions
`/v1/models`	GET	List available models
`/health`	GET	Health check with statistics

Health Endpoint Response

{
  "status": "ok",
  "queue_size": 5,
  "processing": true,
  "active_requests": 12,
  "stats": {
    "total_requests": 150,
    "total_tokens": 12500,
    "generation_tps": "145.2",
    "peak_memory_gb": "28.50"
  }
}

Benchmarks

Tested on M3 Ultra with Gemma-3-27B-4bit:

Scenario	Requests	Tokens	Time	Throughput
8 x 500 tokens	8	4,000	44.9s	89 tok/s
16 x 300 tokens	16	4,800	41.6s	115 tok/s
32 x 200 tokens	32	6,400	48.9s	131 tok/s
64 x 150 tokens	64	9,600	64.8s	148 tok/s

Continuous Batching Benefit

Mixed workload test (short + long requests):

Request Type	Completion Time
Short (20 tokens)	3.6s
Long (200 tokens)	10.2s

Short requests complete 3x faster than with static batching!

CLI Options

usage: mlx-cb-serve [-h] --model MODEL [--host HOST] [--port PORT]
                    [--max-batch-size MAX_BATCH_SIZE]

options:
  -h, --help            show this help message and exit
  --model, -m MODEL     Path to MLX model or HuggingFace model ID
  --host HOST           Host to bind to (default: 127.0.0.1)
  --port, -p PORT       Port to listen on (default: 8080)
  --max-batch-size      Maximum batch size (default: 64)

Python API

from mlx_cb import ContinuousBatchingServer
from mlx_cb.server import create_app
from aiohttp import web

# Create server
server = ContinuousBatchingServer(
    model_path="mlx-community/gemma-3-27b-it-qat-4bit",
    max_batch_size=64
)

# Create and run app
app = create_app(server)
web.run_app(app, host="127.0.0.1", port=8080)

How It Works

Request Queue: Incoming requests are added to a thread-safe queue
BatchGenerator: Uses MLX's BatchGenerator with dynamic insertion
Token-by-Token Processing: Each token generation step checks for:
- Completed requests (released immediately)
- New requests (added to active batch)
Async Response: Completed results are sent back via asyncio futures

Comparison with Other Solutions

Feature	mlx-lm server	LM Studio	This
Batching	None	Static	Continuous
Short request latency	Full batch time	Full batch time	Immediate
Throughput	~35 tok/s	~35 tok/s	~150 tok/s
OpenAI API	Yes	Yes	Yes

License

MIT License

Credits

Built on MLX
Uses mlx-lm for model loading and generation
Inspired by vLLM and SGLang continuous batching implementations

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src/mlx_cb		src/mlx_cb
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MLX Continuous Batching

Features

Why Continuous Batching?

Installation

From Source (Recommended)

Requirements

Quick Start

Start the Server

Use the API

With curl

API Endpoints

Health Endpoint Response

Benchmarks

Continuous Batching Benefit

CLI Options

Python API

How It Works

Comparison with Other Solutions

License

Credits

About

Uh oh!

Languages

License

maxime-dlabai/mlx-continuous-batching

Folders and files

Latest commit

History

Repository files navigation

MLX Continuous Batching

Features

Why Continuous Batching?

Installation

From Source (Recommended)

Requirements

Quick Start

Start the Server

Use the API

With curl

API Endpoints

Health Endpoint Response

Benchmarks

Continuous Batching Benefit

CLI Options

Python API

How It Works

Comparison with Other Solutions

License

Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages