A Go project for serving Large Language Models locally using LLaMA.cpp with gRPC interface.
- gRPC Interface: Clean API for model loading and text generation
- Streaming Support: Real-time text generation with streaming responses
- Model Management: Automatic loading and caching of GGUF models
- Cross-platform: Windows, Linux, and macOS support
- GPU Acceleration: CUDA (Windows/Linux) and Metal (macOS) support
- Docker Support: Ready-to-use Docker images for containerized deployment
- CI/CD: GitHub Actions workflow with automated testing
# Full build: download binaries + build all Go executables
make all
# Run client test, it will run, connect and send test request to gRPC server with a specified model
make run-grpcclienttest MODEL_PATH=/path/to/your/model.gguf# Run integration test with your model
make docker-integration-test MODEL_PATH=/path/to/your/model.gguf
# Or run CI-style test (downloads a small test model automatically)
make docker-integration-test-ci- Go 1.22 or later
- Make (GNU Make)
- GCC/MinGW - C compiler for CGO
- Windows: MinGW-w64 via MSYS2 (includes
gendef,dlltoolfor import libraries) - Linux:
build-essentialpackage - macOS: Xcode command line tools
- Windows: MinGW-w64 via MSYS2 (includes
- Docker with Docker Compose v2
- No other dependencies required
Install MSYS2 and required tools:
# Install MSYS2 from https://www.msys2.org/
# Then in MSYS2 terminal:
pacman -S mingw-w64-x86_64-toolchain mingw-w64-x86_64-tools-git
# Add to PATH: C:\msys64\mingw64\binInstall Homebrew and required dependencies:
# Install Homebrew (if not already installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install libomp (required for OpenMP support in llama.cpp)
brew install libomp
# Install Go (if not already installed)
brew install go| Target | Description |
|---|---|
make all |
Full build: download binaries + build all Go executables |
make prepare |
Download llama.cpp binaries + generate import libraries (Windows) |
make build |
Build all Go executables (assumes prepare was run) |
make clean |
Remove all build artifacts |
make help |
Show all available targets |
make build-grpcserver # Build gRPC server
make build-grpcclienttest # Build gRPC client test
make build-inferencetest1 # Build inference test 1
make build-inferencetest2 # Build inference test 2# Start gRPC server (default port 50052)
make run-grpcserver
# Start server on custom port
make run-grpcserver GRPC_PORT=50053
# Run tests (MODEL_PATH required)
make run-inferencetest1 MODEL_PATH=/path/to/model.gguf
make run-inferencetest2 MODEL_PATH=/path/to/model.gguf
make run-grpcclienttest SERVER_PATH='' ATTACH_GRPC_PORT=50053 MODEL_PATH=/path/to/model.gguf # send request to an already running gRPC server
make run-grpcclienttest MODEL_PATH=/path/to/model.gguf # run the gRPC server at dynamic port, then send request to it| Target | Description |
|---|---|
make docker-build |
Build all Docker images (server + client) |
make docker-build-server |
Build gRPC server Docker image |
make docker-build-client |
Build client test Docker image |
make docker-integration-test MODEL_PATH=<path> |
Run integration test with local model |
make docker-integration-test-ci |
Run integration test (downloads test model) |
make docker-clean |
Remove Docker images and volumes |
| Variable | Default | Description |
|---|---|---|
LLAMA_VERSION |
b6770 |
llama.cpp release version to download |
GRPC_PORT |
50052 |
gRPC server port |
MODEL_PATH |
(none) | Path to GGUF model file (required for tests) |
ATTACH_GRPC_PORT |
(none) | Port of already running gRPC server, for grpcclienttest |
GRPC_SERVER_PATH |
(auto) | Path of gRPC server executable to run, for grpcclienttest |
IMAGE_TAG |
latest |
Docker image tag |
# Download specific version
make LLAMA_VERSION=b6800 prepare
# Full build with specific version
make LLAMA_VERSION=b6800 all
# Docker build with specific version
make docker-build LLAMA_VERSION=b6800.
├── Makefile # Unified build system
├── build/
│ └── llama-binaries/ # Downloaded binaries (auto-created)
│ ├── bin/ # llama.cpp executables
│ ├── lib/ # DLLs/shared libraries + import libs
│ └── include/ # Header files
├── api/
│ └── proto/ # Protocol buffer definitions
├── cmd/
│ ├── grpcserver/ # gRPC server application
│ ├── grpcclienttest/ # gRPC client test
│ ├── inferencetest1/ # Direct inference test 1
│ └── inferencetest2/ # Direct inference test 2
├── docker/
│ ├── Dockerfile.server # gRPC server Docker image
│ ├── Dockerfile.client # Client test Docker image
│ ├── docker-compose.yml # Local integration testing
│ └── docker-compose.ci.yml # CI integration testing
├── scripts/
│ ├── integration-test.sh # Integration test runner (Linux/macOS)
│ └── integration-test.ps1 # Integration test runner (Windows)
├── internal/
│ ├── bindings/ # CGO bindings to llama.cpp
│ ├── grpcserver/ # gRPC server implementation
│ ├── logging/ # Logging utilities
│ └── modelmanagement/ # Model loading and caching
└── .github/
└── workflows/
└── ci.yml # GitHub Actions CI workflow
# Start server (binds to 127.0.0.1 by default)
./cmd/grpcserver/grpcserver --port 50052
# Start server binding to all interfaces (for Docker/remote access)
./cmd/grpcserver/grpcserver --host 0.0.0.0 --port 50052
# Or via make
make run-grpcserver GRPC_PORT=50052| Option | Default | Description |
|---|---|---|
--host |
127.0.0.1 |
Host address to bind (use 0.0.0.0 for Docker) |
--port |
50051 |
Port to listen for gRPC connections |
--ngpu |
99 |
Number of GPU layers to offload |
--mmap |
false |
Use memory-mapped I/O for model loading |
# Run client test against running server
./cmd/grpcclienttest/grpcclienttest --host 127.0.0.1 --port 50052 --model /path/to/model.gguf
# Or via make
make run-grpcclienttest MODEL_PATH=/path/to/model.gguf| Option | Default | Description |
|---|---|---|
--host |
127.0.0.1 |
Server host address to connect to |
--port |
(none) | Server port to connect to |
--server |
(none) | Path to server executable (starts server automatically) |
--model |
(none) | Path to GGUF model file (required) |
--temperature |
0.7 |
Sampling temperature |
--top-p |
1.0 |
Top-p (nucleus) sampling |
--top-k |
0 |
Top-k sampling (0 = disabled) |
--max-tokens |
100 |
Maximum tokens to generate |
--test-mode |
baseline |
Test mode: baseline, greedy, seeded, stress |
--seed |
-1 |
Random seed (-1 = random) |
- Format: GGUF models (e.g.,
model.gguf) - Quantization: Q4_0, Q4_1, Q5_0, Q5_1, Q8_0 supported
- Sources: Hugging Face
# Build both server and client images
make docker-build
# Or build individually
docker build -f docker/Dockerfile.server -t llamacpp-server .
docker build -f docker/Dockerfile.client -t llamacpp-client .# With a local model file
make docker-integration-test MODEL_PATH=/path/to/model.gguf
# CI mode (downloads SmolLM2-135M test model automatically)
make docker-integration-test-ci
# Using scripts directly
./scripts/integration-test.sh --model /path/to/model.gguf
./scripts/integration-test.sh --ci
./scripts/integration-test.sh --model-url https://example.com/model.gguf
# Windows PowerShell
.\scripts\integration-test.ps1 -Model C:\path\to\model.gguf
.\scripts\integration-test.ps1 -CI| Option | Description |
|---|---|
--model PATH |
Path to local GGUF model file |
--model-url URL |
URL to download the model from |
--ci |
Use CI defaults (downloads SmolLM2-135M) |
--test-mode MODE |
Test mode: baseline, greedy, seeded, stress |
--no-cleanup |
Don't remove containers after test |
--build |
Force rebuild Docker images |
--verbose |
Show verbose output |
# Run server container with mounted model
docker run -p 50051:50051 \
-v /path/to/model.gguf:/models/model.gguf:ro \
llamacpp-server
# Or using docker-compose
MODEL_PATH=/path/to/model.gguf docker compose -f docker/docker-compose.yml up server-onlyThe gRPC server Docker image can be used as a dependency in other projects:
# In your project's docker-compose.yml
services:
llm-server:
image: llamacpp-server:latest
ports:
- "50051:50051"
volumes:
- ${MODEL_PATH}:/models/model.gguf:ro
your-service:
build: .
depends_on:
llm-server:
condition: service_healthy
environment:
- LLM_SERVER_HOST=llm-server
- LLM_SERVER_PORT=50051The gRPC server implements the interface defined in api/proto/llmserver.proto:
Loads a model from the specified path with progress updates.
rpc LoadModel(LoadModelRequest) returns (stream LoadModelResponse);Generates text based on input prompt with streaming support.
rpc Predict(PredictRequest) returns (stream PredictResponse);Health check endpoint.
rpc Ping(PingRequest) returns (PingResponse);- Downloads binaries: Official pre-built llama.cpp release for your platform
- Downloads headers: Matching source code to extract header files
- Organizes files: Creates
bin/,lib/,include/structure - Generates import libraries (Windows only): Creates
.dll.afiles from DLLs usinggendef/dlltool
The Dockerfiles use the same Makefile for consistency:
# Uses Makefile to prepare llama.cpp binaries
RUN make prepare LLAMA_VERSION=${LLAMA_VERSION}
# Uses Makefile to build the server
RUN make build-grpcserverThis ensures Docker builds are identical to local builds.
- CUDA Support: Automatically detected if
nvccis in PATH - Import Libraries: Required for linking - generated automatically by
make prepare - Runtime DLLs: Automatically copied to executable directories by run targets
- Apple Silicon (M1/M2/M3/M4): Metal backend enabled automatically
- Intel Macs: CPU-only by default
- libomp: Required for OpenMP support - install with
brew install libomp
- CUDA: Detected automatically if
nvccis available - Vulkan: Binary download uses Vulkan build for cross-vendor GPU support
- Base Image:
debian:bookworm-slimfor minimal size - Architecture: Linux x64 only (for now)
- Dependencies: Only
libgomp1andca-certificatesat runtime
The project includes GitHub Actions CI that runs on every push:
- Build & Test: Builds on Ubuntu, Windows, and macOS
- Docker Integration Test: Runs containerized integration tests
- Artifacts: Uploads built binaries for each platform
See .github/workflows/ci.yml for details.
This project is licensed under the MIT License.
- LLaMA.cpp: MIT License
- gRPC-Go: Apache 2.0 License
- Protocol Buffers: BSD 3-Clause License