OpenAI-compatible inference server with true continuous batching for Apple Silicon (M1/M2/M3/M4).
- True Continuous Batching: New requests join the active batch mid-generation
- Immediate Response: Completed requests are returned instantly without waiting for the batch to finish
- OpenAI-Compatible API: Drop-in replacement for OpenAI API
- High Throughput: Up to 150 tok/s aggregate throughput on M3 Ultra with Gemma 27B
Traditional static batching waits for ALL requests in a batch to complete before returning any responses. This means short requests wait for long ones.
With continuous batching:
- Short requests complete in ~3s
- Long requests complete in ~10s
- Each request returns as soon as it's done
Static Batching :
Short req: ████░░░░░░░░░░░░░░░░ waits... ────► returns at 10s
Long req: ████████████████████ ────────────► returns at 10s
Continuous Batching :
Short req: ████ ───► returns at 3s
Long req: ████████████████████ ────────────► returns at 10s
git clone https://github.com/maxime-dlabai/mlx-continuous-batching.git
cd mlx-continuous-batching
pip install -e .- macOS with Apple Silicon (M1/M2/M3/M4)
- Python 3.10+
- mlx >= 0.21.0
- mlx-lm >= 0.20.0
# With a local model
mlx-cb-serve --model /path/to/mlx-model --port 8080
# With a HuggingFace model (downloads automatically)
mlx-cb-serve --model mlx-community/gemma-3-27b-it-qat-4bit --port 8080
# With custom batch size (for high-memory systems like M3 Ultra)
mlx-cb-serve --model mlx-community/gemma-3-27b-it-qat-4bit --max-batch-size 128from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="mlx-community/gemma-3-27b-it-qat-4bit",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=100
)
print(response.choices[0].message.content)curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | OpenAI-compatible chat completions |
/v1/models |
GET | List available models |
/health |
GET | Health check with statistics |
{
"status": "ok",
"queue_size": 5,
"processing": true,
"active_requests": 12,
"stats": {
"total_requests": 150,
"total_tokens": 12500,
"generation_tps": "145.2",
"peak_memory_gb": "28.50"
}
}Tested on M3 Ultra with Gemma-3-27B-4bit:
| Scenario | Requests | Tokens | Time | Throughput |
|---|---|---|---|---|
| 8 x 500 tokens | 8 | 4,000 | 44.9s | 89 tok/s |
| 16 x 300 tokens | 16 | 4,800 | 41.6s | 115 tok/s |
| 32 x 200 tokens | 32 | 6,400 | 48.9s | 131 tok/s |
| 64 x 150 tokens | 64 | 9,600 | 64.8s | 148 tok/s |
Mixed workload test (short + long requests):
| Request Type | Completion Time |
|---|---|
| Short (20 tokens) | 3.6s |
| Long (200 tokens) | 10.2s |
Short requests complete 3x faster than with static batching!
usage: mlx-cb-serve [-h] --model MODEL [--host HOST] [--port PORT]
[--max-batch-size MAX_BATCH_SIZE]
options:
-h, --help show this help message and exit
--model, -m MODEL Path to MLX model or HuggingFace model ID
--host HOST Host to bind to (default: 127.0.0.1)
--port, -p PORT Port to listen on (default: 8080)
--max-batch-size Maximum batch size (default: 64)
from mlx_cb import ContinuousBatchingServer
from mlx_cb.server import create_app
from aiohttp import web
# Create server
server = ContinuousBatchingServer(
model_path="mlx-community/gemma-3-27b-it-qat-4bit",
max_batch_size=64
)
# Create and run app
app = create_app(server)
web.run_app(app, host="127.0.0.1", port=8080)- Request Queue: Incoming requests are added to a thread-safe queue
- BatchGenerator: Uses MLX's
BatchGeneratorwith dynamic insertion - Token-by-Token Processing: Each token generation step checks for:
- Completed requests (released immediately)
- New requests (added to active batch)
- Async Response: Completed results are sent back via asyncio futures
| Feature | mlx-lm server | LM Studio | This |
|---|---|---|---|
| Batching | None | Static | Continuous |
| Short request latency | Full batch time | Full batch time | Immediate |
| Throughput | ~35 tok/s | ~35 tok/s | ~150 tok/s |
| OpenAI API | Yes | Yes | Yes |
MIT License