Intel Arc A770 LLM Inference Stack

Local LLM inference on Intel Arc A770 16GB using llama.cpp with SYCL backend.

OpenAI-compatible API (/v1/chat/completions, /v1/embeddings)
Native tool/function calling for agentic workflows
Dedicated embedding server (optional, separate llama.cpp instance)
MCP web search server (optional, no API keys)
Speech-to-text via whisper.cpp (optional, multilingual)
SYCL flash attention + fused Gated Delta Net for Qwen3.5
Runs as systemd user services with auto-restart
Persistent services survive logout (with lingering enabled)

Prerequisites

GPU: Intel Arc A770 16GB (also works on A750, B580)
OS: Ubuntu 22.04/24.04 or Debian 12
Kernel: 6.2+
Disk: ~30GB free (oneAPI + llama.cpp + model)

Deploy

git clone <repo-url> ~/intel-gpu-inference
cd ~/intel-gpu-inference
git submodule update --init --recursive

# Install everything (drivers, oneAPI, llama.cpp build, systemd service)
./scripts/install.sh

# Or with optional services
./install.sh --with-mcp              # + MCP web search
./install.sh --with-whisper          # + speech recognition
./install.sh --with-embedding        # + embedding server
./install.sh --all                   # + everything

install.sh handles Intel GPU drivers, oneAPI toolkit, llama.cpp SYCL build, environment config, and systemd service setup. Log out and back in if prompted for group changes.

Enable persistent services (recommended)

By default, systemd user services only run while the user is logged in. To start services at boot and keep them running after logout:

sudo loginctl enable-linger $USER

Models

We use Unsloth GGUF quantizations — they work great with llama.cpp thanks to Dynamic 2.0 quants that upcast important layers.

Prefer Q8_0 when the model fits in VRAM, fall back to Q4_0 for larger models. Legacy quants (Q4_0, Q8_0) are significantly faster than K-quants on Intel GPUs due to optimized SYCL MUL_MAT kernels.

Default Paths

Path	Description
`~/models/`	GGUF model files
`~/intel-gpu-inference/llama.cpp/`	llama.cpp source and SYCL build
`~/intel-gpu-inference/open-websearch/`	MCP web search server (if installed)
`~/intel-gpu-inference/whisper.cpp/`	whisper.cpp source and SYCL build (if installed)
`~/.config/intel-gpu-inference/env`	Environment config (all services)
`~/.config/systemd/user/`	Installed systemd unit files

Services

llama-server — LLM Inference

OpenAI-compatible API serving GGUF models on the Intel Arc GPU.

# Manual
./scripts/run.sh                          # default model from env config
./scripts/run.sh /path/to/model.gguf      # specific model
./scripts/run.sh --ctx 4096               # override context size

# systemd
systemctl --user status llama-server
systemctl --user restart llama-server
journalctl --user -u llama-server -f

API: http://<host>:8080/v1 Test: ./scripts/test.sh

open-websearch — MCP Web Search (Optional)

Multi-engine web search via MCP protocol. No API keys required. Supports DuckDuckGo, Bing, Brave, and others.

# Install
./scripts/install-mcp.sh

# Manual
./scripts/run-mcp.sh

# systemd
systemctl --user status open-websearch
systemctl --user restart open-websearch
journalctl --user -u open-websearch -f

Endpoints: http://<host>:3000/sse (SSE) | http://<host>:3000/mcp (streamableHttp) Test: ./scripts/test-mcp.sh

MCP client config:

{
  "mcpServers": {
    "web-search": {
      "transport": { "type": "sse", "url": "http://<host>:3000/sse" }
    }
  }
}

Tools: search_web, fetchArticle, fetchGithubReadme

embedding-server — Dedicated Embedding Server (Optional)

A separate llama.cpp instance running in embedding-only mode on port 8085. Keeps embedding workloads isolated from the main inference server.

# Install (appends config to env, installs systemd service)
./scripts/install-embedding.sh

# Manual
./scripts/run-embedding.sh
./scripts/run-embedding.sh /path/to/model.gguf

# systemd
systemctl --user status embedding-server
systemctl --user restart embedding-server
journalctl --user -u embedding-server -f

API: POST http://<host>:8085/v1/embeddings (OpenAI-compatible) Test: ./scripts/test-embedding.sh Model: set EMBEDDING_MODEL in ~/.config/intel-gpu-inference/env

whisper-server — Speech Recognition (Optional)

Multilingual speech-to-text via whisper.cpp with SYCL GPU acceleration. Supports Arabic, English, French, Chinese, and 90+ languages.

# Install
./scripts/install-whisper.sh

# Manual
./scripts/run-whisper.sh
./scripts/run-whisper.sh /path/to/model.bin

# systemd
systemctl --user status whisper-server
systemctl --user restart whisper-server
journalctl --user -u whisper-server -f

API: POST http://<host>:9090/inference (multipart/form-data with audio file) Test: ./scripts/test-whisper.sh Model: ggml-large-v3.bin (~3.9GB VRAM)

Configuration

All services read from ~/.config/intel-gpu-inference/env. Edit and restart the relevant service.

Variable	Default	Description
`DEFAULT_MODEL`	`~/models/Qwen3VL-8B-Instruct-Q8_0.gguf`	Active model path
`MMPROJ_PATH`	`~/models/mmproj-Qwen3VL-8B-Instruct-F16.gguf`	Vision projector (blank to disable)
`LLAMA_HOST`	`0.0.0.0`	Server bind address
`LLAMA_PORT`	`8080`	Server port
`DEFAULT_SEARCH_ENGINE`	`duckduckgo`	MCP search engine
`PORT`	`3000`	MCP server port
`EMBEDDING_MODEL`	(required)	Embedding model path (.gguf)
`EMBEDDING_HOST`	`0.0.0.0`	Embedding server bind address
`EMBEDDING_PORT`	`8085`	Embedding server port
`EMBEDDING_CTX`	`8192`	Embedding context size
`WHISPER_MODEL`	`~/models/ggml-large-v3.bin`	Whisper model path
`WHISPER_HOST`	`0.0.0.0`	Whisper server bind address
`WHISPER_PORT`	`9090`	Whisper server port
`WHISPER_LANGUAGE`	`auto`	Language: auto, en, ar, fr, zh
`LLAMA_ARG_WEBUI_MCP_PROXY`	`false`	Enable MCP proxy in web UI (set `true` for web UI MCP)
`UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS`	`1`	Allow >4GB VRAM allocations
`ONEAPI_DEVICE_SELECTOR`	auto	GPU selection (set if iGPU conflict)

Directory Structure

intel-gpu-inference/
├── install.sh                        # Top-level installer (service + optional MCP/whisper/embedding)
├── llama-server.service.template     # systemd unit template
├── open-websearch.service.template   # systemd unit template (MCP)
├── whisper-server.service.template   # systemd unit template (whisper)
├── embedding-server.service.template # systemd unit template (embedding)
├── scripts/
│   ├── install.sh                    # Full build installer (drivers, oneAPI, llama.cpp)
│   ├── install-mcp.sh                # MCP web search installer
│   ├── install-whisper.sh            # whisper.cpp speech recognition installer
│   ├── install-embedding.sh          # embedding server installer
│   ├── run.sh                        # llama-server launcher
│   ├── run-mcp.sh                    # MCP web search launcher
│   ├── run-whisper.sh                # whisper-server launcher
│   ├── run-embedding.sh              # embedding-server launcher
│   ├── test.sh                       # LLM API test suite
│   ├── test-mcp.sh                   # MCP server test suite
│   ├── test-whisper.sh               # whisper server test suite
│   └── test-embedding.sh             # embedding server test suite
├── configs/
│   ├── llama-server.env.template     # Environment template
│   ├── open-websearch.env.template   # MCP environment template
│   ├── whisper-server.env.template   # Whisper environment template
│   └── embedding-server.env.template # Embedding environment template
├── docs/
│   ├── research.md                   # Evaluation of Intel GPU inference options
│   └── models.md                     # Model recommendations for 16GB VRAM
├── llama.cpp/                        # Submodule: source and SYCL build
├── open-websearch/                   # Submodule: MCP web search server
└── whisper.cpp/                      # Submodule: speech recognition server

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.githooks		.githooks
configs		configs
docs		docs
llama.cpp @ e95dae1		llama.cpp @ e95dae1
open-websearch @ 3094fa5		open-websearch @ 3094fa5
scripts		scripts
whisper.cpp @ df7638d		whisper.cpp @ df7638d
.gitignore		.gitignore
.gitmodules		.gitmodules
AGENTS.md		AGENTS.md
README.md		README.md
embedding-server-2.service.template		embedding-server-2.service.template
embedding-server.service.template		embedding-server.service.template
install.sh		install.sh
llama-server.service.template		llama-server.service.template
notesLLamaCPP.txt		notesLLamaCPP.txt
open-websearch.service.template		open-websearch.service.template
whisper-server.service.template		whisper-server.service.template

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intel Arc A770 LLM Inference Stack

Prerequisites

Deploy

Enable persistent services (recommended)

Models

Default Paths

Services

llama-server — LLM Inference

open-websearch — MCP Web Search (Optional)

embedding-server — Dedicated Embedding Server (Optional)

whisper-server — Speech Recognition (Optional)

Configuration

Directory Structure

Links

llama.cpp

Intel

Models

whisper.cpp

MCP

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Intel Arc A770 LLM Inference Stack

Prerequisites

Deploy

Enable persistent services (recommended)

Models

Default Paths

Services

llama-server — LLM Inference

open-websearch — MCP Web Search (Optional)

embedding-server — Dedicated Embedding Server (Optional)

whisper-server — Speech Recognition (Optional)

Configuration

Directory Structure

Links

llama.cpp

Intel

Models

whisper.cpp

MCP

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages