A self-contained, reproducible local LLM inference setup for the Intel Arc A770 (16GB VRAM) using llama.cpp with SYCL backend. Serves models via an OpenAI-compatible API with full tool/function calling support.
After evaluating five approaches (see docs/research.md), llama.cpp with SYCL was chosen because it is the only solution that:
- Actually works on consumer Intel Arc GPUs (verified on A770, A750, B580)
- Provides a full OpenAI-compatible API (
/v1/chat/completions,/v1/completions,/v1/embeddings) - Has native tool/function calling with parsers for Llama 3.x, Qwen 2.5, Mistral, Hermes, DeepSeek R1
- Supports all GGUF quantization formats (Q4_0, Q8_0, K-quants)
- Is actively maintained and not dependent on archived Intel projects (IPEX-LLM was archived Jan 2026)
- Can be installed natively without Docker
- GPU: Intel Arc A770 16GB (also works on A750, B580, and other Arc GPUs)
- OS: Ubuntu 22.04/24.04 or Debian 12
- Kernel: 6.2+ (for Intel Arc support; Ubuntu 22.04 HWE or later)
- Disk: ~30GB free (oneAPI toolkit + llama.cpp + model)
- RAM: 16GB+ system RAM recommended
# 1. Clone/download this directory
cd ~/intel-gpu-inference
# 2. Run the installer (installs drivers, oneAPI, builds llama.cpp, downloads model)
./scripts/install.sh
# 3. Log out and back in if prompted (for group membership changes)
# 4. Start the server
./scripts/run.sh
# 5. Test the API (in another terminal)
./scripts/test.shThe API will be available at http://0.0.0.0:8080/v1 (all interfaces by default).
To restrict to localhost only:
LLAMA_HOST=127.0.0.1 ./scripts/run.shTo have the server start automatically on boot and restart on failure:
# Install the service
sudo cp configs/llama-server.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now llama-serverCommon commands:
sudo systemctl status llama-server # check status
sudo systemctl restart llama-server # restart
sudo systemctl stop llama-server # stop
journalctl -u llama-server -f # follow logsTo change the model or settings, edit configs/env.sh and restart the service.
intel-gpu-inference/
├── README.md # This file
├── scripts/
│ ├── install.sh # Full installation script
│ ├── run.sh # Server launcher with auto-tuning
│ └── test.sh # API test suite (completion, streaming, tool calling)
├── configs/
│ ├── env.sh # Environment variables (generated by install.sh)
│ └── llama-server.service # systemd service unit
├── docs/
│ ├── research.md # Detailed evaluation of all Intel GPU inference options
│ └── models.md # Model recommendations for 16GB VRAM
├── models/ # Downloaded GGUF model files (created by install.sh)
└── llama.cpp/ # llama.cpp source and build (created by install.sh)
pip install huggingface-hub
# Example: Download Llama 3.1 8B Instruct Q8
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
Meta-Llama-3.1-8B-Instruct-Q8_0.gguf \
--local-dir ~/intel-gpu-inference/models
# Example: Download Qwen2.5 14B Q4 (larger model, smaller quant)
huggingface-cli download Qwen/Qwen2.5-14B-Instruct-GGUF \
qwen2.5-14b-instruct-q4_0.gguf \
--local-dir ~/intel-gpu-inference/models./scripts/run.sh ~/intel-gpu-inference/models/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf./scripts/run.sh --ctx 4096 ~/intel-gpu-inference/models/qwen2.5-14b-instruct-q4_0.gguf| Use Case | Model | Quant | VRAM | Notes |
|---|---|---|---|---|
| Tool/Function Calling | Qwen2.5-7B-Instruct | Q8_0 | ~7.5 GB | Best tool calling at this size |
| General Chat | Llama-3.1-8B-Instruct | Q8_0 | ~8 GB | Well-rounded, good instruction following |
| Coding | Qwen2.5-14B-Instruct | Q4_0 | ~8 GB | Quality jump for code tasks |
| Reasoning | DeepSeek-R1-Distill-Qwen-14B | Q4_0 | ~8 GB | Chain-of-thought reasoning |
| Long Context | Phi-3.5-mini-instruct | Q8_0 | ~4 GB | Fits 16K-32K context easily |
| Speed | Mistral-7B-Instruct-v0.3 | Q4_0 | ~4 GB | Fast with low VRAM footprint |
See docs/models.md for comprehensive model recommendations.
Important: Legacy quants (Q4_0, Q5_0, Q8_0) are significantly faster than K-quants (Q4_K_M, Q5_K_M) on Intel GPUs. The SYCL backend has optimized MUL_MAT kernels for legacy formats but not yet for K-quants.
| Format | Speed (7B) | Quality | Recommendation |
|---|---|---|---|
| Q4_0 | ~55 t/s | Acceptable | Best speed |
| Q8_0 | ~25-30 t/s | Good | Best balance for Intel |
| Q4_K_M | ~16-20 t/s | Good | Slower, use Q8_0 instead |
| Q5_K_M | ~14-18 t/s | Better | Slower, use Q8_0 instead |
Recommendation: Prefer Q8_0 when the model fits, fall back to Q4_0 for larger models.
curl http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="not-needed")
# Basic completion
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "What is 2+2?"}],
)
print(response.choices[0].message.content)
# Tool calling
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}
}]
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "What's the weather in Paris?"}],
tools=tools,
tool_choice="auto",
)
print(response.choices[0].message.tool_calls)from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="http://127.0.0.1:8080/v1",
api_key="not-needed",
model="default",
)
response = llm.invoke("Explain quantum computing in one sentence.")- K-quant performance gap: Q4_K_M/Q5_K_M run ~3x slower than Q4_0/Q8_0 due to unoptimized SYCL kernels
- No flash attention on some models: Flash attention may not work with all model architectures on SYCL
- iGPU conflicts: Systems with integrated + discrete Intel GPUs may need
ONEAPI_DEVICE_SELECTORconfiguration - oneAPI size: The toolkit requires ~20GB of disk space
- Volunteer-maintained: SYCL backend is maintained by community contributors, not a full Intel team
- Performance vs NVIDIA: Expect roughly 50-70% of equivalent NVIDIA GPU performance for most operations
- Batch inference: Concurrent request handling works but is less optimized than on CUDA
Error: [ERROR] Failed to initialize Level Zero
# Check if Level Zero is installed
dpkg -l | grep -E "level-zero|libze"
# If missing, install (package names changed in Ubuntu 24.04 Noble):
# Ubuntu 22.04 (Jammy):
sudo apt install intel-level-zero-gpu level-zero level-zero-dev
# Ubuntu 24.04 (Noble) and later:
sudo apt install libze-intel-gpu1 libze1 libze-dev
# Verify device is accessible:
ls -la /dev/dri/
# You should see renderD128 (or similar)
# Check permissions:
groups # Should include 'render' and 'video'
# If not:
sudo usermod -aG render,video $USER
# Then log out and back inError: No SYCL devices found or sycl-ls shows no GPU
# Source oneAPI environment
source /opt/intel/oneapi/setvars.sh
# List SYCL devices
sycl-ls
# If only iGPU shows up, force discrete GPU:
export ONEAPI_DEVICE_SELECTOR="level_zero:1"
# Then try sycl-ls again
# If no devices show at all, check kernel driver:
dmesg | grep -i "i915\|xe"
# Ensure the driver is loaded:
lsmod | grep -E "i915|xe"Error: failed to allocate memory or UR_RESULT_ERROR_OUT_OF_RESOURCES
# This is the #1 issue. Set the relaxed allocation limit:
export UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1
# This is already set in configs/env.sh, but verify:
echo $UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS
# Should print: 1
# If the model still doesn't fit, use a smaller quant or reduce context:
./scripts/run.sh --ctx 2048 /path/to/model.ggufIntel Arc GPUs can use either the i915 or xe kernel driver:
# Check which driver is in use:
lspci -k | grep -A 3 "VGA.*Intel"
# i915: Legacy driver, works on Ubuntu 22.04+
# xe: Newer driver, default on newer kernels (6.8+)
# Both work with llama.cpp SYCL. If you have issues with one, try the other.
# Force i915 (if xe causes issues):
# Add to kernel command line: i915.force_probe=56a1 xe.force_probe=!56a1
# (56a1 is the PCI ID for Arc A770; check yours with lspci)# 1. Verify GPU is being used (not CPU fallback)
# Look for "SYCL" and "Intel" in server startup output
# 2. Use legacy quants (Q4_0, Q8_0) instead of K-quants
# Q4_K_M is ~3x slower than Q4_0 on Intel GPUs
# 3. Check thermal throttling
sudo intel_gpu_top # (from intel-gpu-tools package)
# 4. Ensure flat device hierarchy
export ZE_FLAT_DEVICE_HIERARCHY=FLAT
# 5. Disable simultaneous multithreading if prompt processing is slow
# (Requires BIOS change - test first to verify improvement)# Ensure oneAPI compilers are in PATH:
source /opt/intel/oneapi/setvars.sh
which icx icpx # Should show paths under /opt/intel/
# Clean rebuild:
cd ~/intel-gpu-inference/llama.cpp
rm -rf build
cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
cmake --build build --config Release -j $(nproc)
# If cmake can't find SYCL, ensure you sourced setvars.sh in the SAME shell# Try with minimal settings:
./llama.cpp/build/bin/llama-server \
--model models/your-model.gguf \
--ctx-size 2048 \
--n-gpu-layers 1
# If it works with 1 GPU layer, gradually increase:
# --n-gpu-layers 10, 20, 30, -1
# This helps identify VRAM limits
# Check dmesg for GPU errors:
dmesg | tail -20- Use Q8_0 or Q4_0 quantization (legacy quants with optimized SYCL kernels)
- Set
UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1(already in env.sh) - Set
ZE_FLAT_DEVICE_HIERARCHY=FLATfor consistent single-GPU behavior - Right-size context: Larger context = more VRAM. Start with 4096 and increase if needed
- Use
--flash-attnwhen supported by the model architecture - Monitor with
intel_gpu_top: Install viasudo apt install intel-gpu-tools - Avoid K-quants for speed-critical applications on Intel GPUs
- Keep oneAPI updated: Newer versions often include SYCL performance improvements
| Variable | Purpose | Default |
|---|---|---|
UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS |
Allow >4GB VRAM allocations | 1 (set in env.sh) |
ZE_FLAT_DEVICE_HIERARCHY |
Device hierarchy mode | FLAT |
ONEAPI_DEVICE_SELECTOR |
Select specific GPU | Auto (set if iGPU conflict) |
LLAMA_HOST |
Server bind address | 0.0.0.0 (use 127.0.0.1 to restrict to localhost) |
LLAMA_PORT |
Server port | 8080 |