LLM-Router-Server-Dashboard is a solution for large language model (LLM) deployment and management, providing an intuitive web interface to manage, monitor, and operate multiple LLM model instances.
This project combines a routing server (LLM-Router-Server) with an easy-to-use management interface, enabling you to:
- Visual Management: Easily manage multiple models through a web interface
- Dynamic Control: Start and stop models in real-time without service restarts
- Real-time Monitoring: Monitor model status, GPU utilization, and system information
- Configuration Management: Flexibly manage model parameters through YAML configuration files
-
Multi-Model Management
- Support for managing multiple LLM models simultaneously (based on vLLM)
- Support for Embedding and Reranking models
- Independent model lifecycle management (start/stop)
- Automatically selects the least-loaded instance based on real-time metrics (running requests, waiting requests, KV cache usage)
-
Visual Control Panel
- Real-time display of model running status
- GPU resource monitoring
- System resource usage statistics
- Model configuration viewing and editing
-
Resource Management
- GPU device allocation and management
- Memory usage monitoring
- Multi-GPU parallel support (Tensor Parallel)
- GPU: NVIDIA GPU (CUDA 12.1+ recommended)
- Memory: 16GB+ RAM (depending on model size)
- Disk: 50GB+ available space
cd frontend/docker
docker build -t llm-router-server-dashboard .
docker-compose -f docker-compose.yaml up -dcd frontend
npm install
npm run devcd frontend
npm install
npm run buildEdit frontend/.env.local:
VITE_API_BASE_URL=http://localhost:5000
VITE_MODEL_CONTROL_PASSWORD=123Edit frontend/vite.config.js:
export default defineConfig({
server: {
host: '0.0.0.0', // Allow external access
port: 5111 // Custom port
}
})Important Note: The backend needs to monitor LLM model status (process management), so it must run in the same container as LLM-Router-Server.
cd LLM-Router-Server/docker
docker build -t cuda121-cudnn8-python311 .
docker-compose -f docker-compose.yaml up -dEnsure docker-compose.yaml exposes necessary ports:
8887: LLM-Router-Server API5000: Dashboard Backend API- Other model ports (e.g., 8002, 8003, etc.)
# Enter the container
docker exec -it <container_id> bash
# Start backend
cd /app/backend
pip install -r requirements.txt
uvicorn main:app --reload --host 0.0.0.0 --port 5000For installation and startup details, refer to LLM-Router-Server Startup Guide
cd /app/LLM-Router-Server
pip install -r requirements.txt
sh scripts/start_all.sh /app/backend/config.yaml ./configs/gunicorn.conf.pyNote: Use /app/backend/config.yaml as the unified configuration file to ensure consistency between frontend and backend.
# Check router server
curl http://localhost:8887/health
# Check backend API
curl http://localhost:5000/api/statusThe configuration file is located at backend/config.yaml and controls all model startup parameters.
# Router server configuration
server:
host: "0.0.0.0"
port: 8887
uvicorn_log_level: "info"
# LLM model configuration
LLM_engines:
Qwen3-0.6B:
instances:
- id: "qwen3"
host: "localhost"
port: 8002
cuda_device: 0
- id: "qwen3-2"
host: "localhost"
port: 8004
cuda_device: 0
model_config:
model_tag: "Qwen/Qwen3-0.6B"
dtype: "float16"
max_model_len: 500
gpu_memory_utilization: 0.35
tensor_parallel_size: 1
# Embedding server configuration (optional)
embedding_server:
host: "localhost"
port: 8005
cuda_device: 1
embedding_models:
m3e-base:
model_name: "moka-ai/m3e-base"
model_path: "./models/embedding_engine/model/embedding_model/m3e-base-model"
tokenizer_path: "./models/embedding_engine/model/embedding_model/m3e-base-tokenizer"
max_length: 512
use_gpu: true
use_float16: true
reranking_models:
bge-reranker-large:
model_name: "BAAI/bge-reranker-large"
model_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-model"
tokenizer_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-tokenizer"
max_length: 512
use_gpu: true
use_float16: true| Parameter | Description | Recommended Value |
|---|---|---|
gpu_memory_utilization |
GPU memory usage ratio | 0.6-0.9 |
max_model_len |
Maximum context length | Based on model capability |
tensor_parallel_size |
Multi-GPU parallelism count | Number of GPUs |
dtype |
Inference precision | float16 (faster) / bfloat16 (more stable) |
cuda_device |
GPU device number | 0, 1, 2... |
Design Limitation: The current version requires starting models one at a time to ensure:
- Proper GPU resource allocation
- Avoid memory overflow
- Process management stability
Future versions will optimize parallel startup support.

