Skip to content

maxime-dlabai/mlx-continuous-batching

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLX Continuous Batching

OpenAI-compatible inference server with true continuous batching for Apple Silicon (M1/M2/M3/M4).

Features

  • True Continuous Batching: New requests join the active batch mid-generation
  • Immediate Response: Completed requests are returned instantly without waiting for the batch to finish
  • OpenAI-Compatible API: Drop-in replacement for OpenAI API
  • High Throughput: Up to 150 tok/s aggregate throughput on M3 Ultra with Gemma 27B

Why Continuous Batching?

Traditional static batching waits for ALL requests in a batch to complete before returning any responses. This means short requests wait for long ones.

With continuous batching:

  • Short requests complete in ~3s
  • Long requests complete in ~10s
  • Each request returns as soon as it's done
Static Batching :
  Short req: ████░░░░░░░░░░░░░░░░ waits... ────► returns at 10s
  Long req:  ████████████████████ ────────────► returns at 10s

Continuous Batching :
  Short req: ████ ───► returns at 3s
  Long req:  ████████████████████ ────────────► returns at 10s

Installation

From Source (Recommended)

git clone https://github.com/maxime-dlabai/mlx-continuous-batching.git
cd mlx-continuous-batching
pip install -e .

Requirements

  • macOS with Apple Silicon (M1/M2/M3/M4)
  • Python 3.10+
  • mlx >= 0.21.0
  • mlx-lm >= 0.20.0

Quick Start

Start the Server

# With a local model
mlx-cb-serve --model /path/to/mlx-model --port 8080

# With a HuggingFace model (downloads automatically)
mlx-cb-serve --model mlx-community/gemma-3-27b-it-qat-4bit --port 8080

# With custom batch size (for high-memory systems like M3 Ultra)
mlx-cb-serve --model mlx-community/gemma-3-27b-it-qat-4bit --max-batch-size 128

Use the API

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="mlx-community/gemma-3-27b-it-qat-4bit",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=100
)
print(response.choices[0].message.content)

With curl

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

API Endpoints

Endpoint Method Description
/v1/chat/completions POST OpenAI-compatible chat completions
/v1/models GET List available models
/health GET Health check with statistics

Health Endpoint Response

{
  "status": "ok",
  "queue_size": 5,
  "processing": true,
  "active_requests": 12,
  "stats": {
    "total_requests": 150,
    "total_tokens": 12500,
    "generation_tps": "145.2",
    "peak_memory_gb": "28.50"
  }
}

Benchmarks

Tested on M3 Ultra with Gemma-3-27B-4bit:

Scenario Requests Tokens Time Throughput
8 x 500 tokens 8 4,000 44.9s 89 tok/s
16 x 300 tokens 16 4,800 41.6s 115 tok/s
32 x 200 tokens 32 6,400 48.9s 131 tok/s
64 x 150 tokens 64 9,600 64.8s 148 tok/s

Continuous Batching Benefit

Mixed workload test (short + long requests):

Request Type Completion Time
Short (20 tokens) 3.6s
Long (200 tokens) 10.2s

Short requests complete 3x faster than with static batching!

CLI Options

usage: mlx-cb-serve [-h] --model MODEL [--host HOST] [--port PORT]
                    [--max-batch-size MAX_BATCH_SIZE]

options:
  -h, --help            show this help message and exit
  --model, -m MODEL     Path to MLX model or HuggingFace model ID
  --host HOST           Host to bind to (default: 127.0.0.1)
  --port, -p PORT       Port to listen on (default: 8080)
  --max-batch-size      Maximum batch size (default: 64)

Python API

from mlx_cb import ContinuousBatchingServer
from mlx_cb.server import create_app
from aiohttp import web

# Create server
server = ContinuousBatchingServer(
    model_path="mlx-community/gemma-3-27b-it-qat-4bit",
    max_batch_size=64
)

# Create and run app
app = create_app(server)
web.run_app(app, host="127.0.0.1", port=8080)

How It Works

  1. Request Queue: Incoming requests are added to a thread-safe queue
  2. BatchGenerator: Uses MLX's BatchGenerator with dynamic insertion
  3. Token-by-Token Processing: Each token generation step checks for:
    • Completed requests (released immediately)
    • New requests (added to active batch)
  4. Async Response: Completed results are sent back via asyncio futures

Comparison with Other Solutions

Feature mlx-lm server LM Studio This
Batching None Static Continuous
Short request latency Full batch time Full batch time Immediate
Throughput ~35 tok/s ~35 tok/s ~150 tok/s
OpenAI API Yes Yes Yes

License

MIT License

Credits

  • Built on MLX
  • Uses mlx-lm for model loading and generation
  • Inspired by vLLM and SGLang continuous batching implementations

About

OpenAI-compatible server with continuous batching for MLX on Apple Silicon

Topics

Resources

License

Stars

Watchers

Forks

Languages