cyllama: a thin cython wrapper around llama.cpp #10650

shakfu · 2024-12-04T09:18:17Z

shakfu
Dec 4, 2024

H folks,

Ok, this is my show and tell 😄

In case anyone's interested, I've been working for some time on the open-source cyllama project, a thin cython wrapper for llama.cpp. It was spun-off from an earlier, now frozen, llama.cpp wrapper project, llamalib which provided early stage, but functional, wrappers using cython, pybind11, and nanobind.

In cyllama, libllama.a, libggml.a, and other related static libs are statically linked to this python extension for simplicity and performance: as a wheel it's around 1.2 MB. It can perform basic inference via a high-level and lower-level interface wrapping llama.h and parts of common.h and others as necessary. It generally tries to keep up with the latest changes in llama.cpp while maintaining some kind of stability in terms of all tests passing and error-free compilation in between updates.

Development goals are to:

Stay up-to-date with bleeding-edge llama.cpp.
Produce a minimal, performant, compiled, thin python wrapper around the core llama-cli feature-set of llama.cpp.
Integrate and wrap llava-cli features.
Integrate and wrap features from related projects such as whisper.cpp and stable-diffusion.cpp
Learn about the internals of this popular C++/C LLM inference engine along the way. This is definitely the most efficient way, for me at least, to learn about the underlying technologies.

If you try it, please provide feedback, ask questions, post-bugs, etc., -- any contributions are welcome!

shakfu · 2025-11-22T14:23:56Z

shakfu
Nov 22, 2025
Author

cyllama Update - November 2025

Update on the cyllama Project

It's been nearly a year since my last announcement, and I wanted to share what's new with cyllama - the thin cython wrapper for llama.cpp.

A quick reminder: cyllama is a minimal, performant, compiled Python extension wrapping llama.cpp's core functionality. It statically links libllama.a and libggml.a for simplicity and performance (~1.2 MB wheel).

What's Changed Since December 2024

Thanks to the targeted use of AI agents, the project has managed to keep up with the fast pace of changes at llama.cpp and is currently tracking release b7126. Here is a summary of some changes since the last post.

1. High-Level Python API

We now have a complete, Pythonic API layer that makes cyllama more pleasant to use:

from cyllama import complete, chat, LLM

# Simple one-liner
response = complete("What is Python?", model_path="model.gguf")

# Reusable LLM instance (model stays loaded)
llm = LLM("model.gguf")
response = llm("Your question here")

# Multi-turn chat with proper message formatting
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum computing"}
]
response = chat(messages, model_path="model.gguf")

Why this matters: Previously, you had to manually manage models, contexts, samplers, and batches. Now it's automatic with sensible defaults, but full control is still available when needed.

2. Chat Templates & Conversation Support

Full support for chat templates and multi-turn conversations through the high-level API:

from cyllama import chat

# Multi-turn conversation with automatic template formatting
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "What is Python?"}
]
response = chat(messages, model_path="model.gguf")

# Or use the Chat class for interactive CLI
from cyllama.llama.chat import Chat

chat_session = Chat(model_path="model.gguf")
chat_session.chat_loop()  # Interactive chat with template auto-detection

Features:

Automatic chat template detection from model metadata
Supports built-in templates (ChatML, Llama-3, Mistral, etc.)
Custom template support via LlamaChatMessage and chat_apply_template()
Conversation history management

3. Text-to-Speech (TTS) Support

Full TTS integration for voice generation:

from cyllama.llama import TTSGenerator

tts = TTSGenerator("models/outetts-0.2-500M-Q8_0.gguf")

# Generate speech from text
tts.generate(
    text="Hello, this is a test of the text to speech system.",
    output_file="output.wav"
)

Features:

Supports OuteTTS and similar TTS models
WAV file output with configurable sample rate
Speaker voice cloning support
Handles text preprocessing (numbers to words, etc.)
Streaming audio generation

4. Multimodal (LLAVA/Vision) Support

Vision-language models for image understanding:

from cyllama.llama.mtmd import MultimodalProcessor, VisionLanguageChat
from cyllama import LlamaModel, LlamaContext

# Load model and create processor
model = LlamaModel("models/llava-v1.6-mistral-7b.Q4_K_M.gguf")
ctx = LlamaContext(model)

# Initialize vision processor
processor = MultimodalProcessor("models/mmproj-model-f16.gguf", model)

# Or use high-level chat interface
vision_chat = VisionLanguageChat("models/mmproj-model-f16.gguf", model, ctx)
response = vision_chat.ask_about_image("What's in this image?", "image.jpg")

Capabilities:

Image understanding and description
Visual question answering
Support for multiple images in conversation
Works with LLAVA, BakLLaVA, and similar vision-language models
Automatic vision capability detection

5. Embedded HTTP Server

Embedded HTTP servers with OpenAI-compatible API:

from cyllama.llama.server import PythonServer

# Create server with configuration
server = PythonServer(
    model_path="model.gguf",
    host="127.0.0.1",
    port=8080
)

# Start server (runs in background thread)
server.start()

# Server provides OpenAI-compatible endpoints:
# POST /v1/chat/completions
# POST /v1/completions
# GET /v1/models
# GET /health

Server Features:

OpenAI API compatibility (drop-in replacement)
Streaming support (SSE)
CORS support
Graceful shutdown
Thread-safe request handling
Includes multiple server implementations:
- PythonServer: Python-based with threading
- EmbeddedServer: High-performance C-based server using Mongoose
- LlamaServer: Python wrapper around the llama.cpp server binary (if it can be found)

Example with curl:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7
  }'

6. Framework Integrations

OpenAI-Compatible API:

from cyllama.integrations import OpenAIClient

client = OpenAIClient(model_path="model.gguf")
response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7
)

LangChain:

from cyllama.integrations import CyllamaLLM
from langchain.chains import LLMChain

llm = CyllamaLLM(model_path="model.gguf", temperature=0.7)
chain = LLMChain(llm=llm, prompt=prompt_template)
result = chain.run(topic="AI")

Both work seamlessly with existing code expecting OpenAI or LangChain interfaces.

7. Performance Features

Speculative Decoding: 2-3x speedup using draft models
N-gram Cache: 2-10x speedup for repetitive patterns (great for code completion)
Memory Optimization: Automatic GPU layer estimation based on available VRAM

8. Utility Features

GGUF Manipulation: Read/write model files, inspect/modify metadata
JSON Schema → Grammar: Generate structured output with type safety
Model Downloads: Ollama-style downloads from HuggingFace (download_model("user/repo:quantization"))

9. Quality of Life

Logging: Debug output disabled by default (add verbose=True to enable)
Documentation: Comprehensive user guide, API reference, and cookbook (1,200+ lines total)
Tests: 276 passing tests for reliability
API Clarity: Renamed generate() → complete() and Generator → LLM for better semantics

Current Status

Version: 0.1.9 (November 21, 2025)
llama.cpp Version: b7126 (tracking bleeding-edge)
whisper.cpp: Integrated and tested
Tests: 276 passing
Platform: macOS (primary), Linux (tested)

API Coverage - All Major Goals Met:

Core llama.cpp wrapper (complete)
High-level Python API (complete)
llava-cli features (multimodal complete)
whisper.cpp integration (complete)
Chat templates and conversation support (complete)
TTS support (complete)
HTTP server with OpenAI API (complete)
stable-diffusion.cpp (future)

Why This Update Matters

Before: You needed 50+ lines of boilerplate to do basic inference, manually managing model lifecycle.

Now: One line for simple cases, with full power available when needed:

# Text generation - one line!
response = complete("Your prompt", model_path="model.gguf")

# Chat conversations - easy!
response = chat(messages, model_path="model.gguf")

# TTS - simple!
tts.generate("Hello world", "output.wav")

# Vision - straightforward!
response = vision_chat.ask_about_image("What's in this?", "image.jpg")

# HTTP server
server = PythonServer(model_path="model.gguf")
server.start()

The library is now genuinely ready for:

Quick prototyping and experiments
Chat applications with proper conversation handling
Voice applications (TTS)
Vision/multimodal applications (LLAVA)
API servers (OpenAI-compatible)
Integration into existing Python stacks (FastAPI, Flask, LangChain)
Performance-critical applications (speculative decoding, n-gram caching)

Use Cases Now Supported

Text Generation: Simple completions, structured output
Chat Applications: Multi-turn conversations with template support
Voice Applications: Text-to-speech with WAV output
Vision Applications: Image understanding and visual Q&A
API Services: HTTP servers with OpenAI compatibility
Framework Integration: Works with LangChain, OpenAI clients
Performance: Speculative decoding, n-gram caching

Resources

Repo: https://github.com/shakfu/cyllama
Docs: See docs/ directory (user guide, API reference, cookbook)
Examples: See tests/examples/ directory
- Chat applications
- TTS examples
- Multimodal demos
- Server implementations

What's Next?

Potential future work:

Async API support (async def complete_async())
Response caching
RAG utilities
stable-diffusion.cpp integration

Feedback Welcome

As always, if you try it out:

Questions? Ask away!
Bugs? Please report them!
Features? Suggestions welcome!
Contributions? Pull requests accepted!

The goal remains: stay lean, stay fast, stay current with llama.cpp, and make it easy to use from Python.

0 replies

shakfu · 2025-11-27T05:45:15Z

shakfu
Nov 27, 2025
Author

cyllama Update - November 2025 (v0.1.12)

What's New in cyllama

This release brings two major new capabilities: a zero-dependency Agent Framework and Stable Diffusion image generation support.

A quick reminder: cyllama is a performant, compiled Cython wrapper for llama.cpp that provides both low-level access and a high-level Pythonic API. It statically links the core libraries for simplicity and performance.

Major New Features

1. Agent Framework (Zero Dependencies)

cyllama now includes a complete agent framework with three agent architectures, all with zero external dependencies:

ReActAgent - Reasoning + Acting agent with tool calling:

from cyllama import LLM
from cyllama.agents import ReActAgent, tool

@tool
def calculate(expression: str) -> str:
    """Evaluate a math expression."""
    return str(eval(expression))

@tool
def search(query: str) -> str:
    """Search for information."""
    return f"Results for: {query}"

llm = LLM("model.gguf")
agent = ReActAgent(llm=llm, tools=[calculate, search])
result = agent.run("What is 25 * 4 + 10?")
print(result.answer)  # "The result is 110"

ConstrainedAgent - Grammar-enforced tool calling for 100% reliability:

from cyllama.agents import ConstrainedAgent

# Uses GBNF grammars to guarantee valid JSON tool calls
agent = ConstrainedAgent(llm=llm, tools=[calculate])
result = agent.run("Calculate 100 / 4")  # Always produces valid tool calls

ContractAgent - Contract-based agent with pre/post conditions (C++26-inspired):

from cyllama.agents import ContractAgent, tool, pre, post, ContractPolicy

@tool
@pre(lambda args: args['x'] != 0, "cannot divide by zero")
@post(lambda r: r is not None, "result must not be None")
def divide(a: float, x: float) -> float:
    """Divide a by x."""
    return a / x

agent = ContractAgent(
    llm=llm,
    tools=[divide],
    policy=ContractPolicy.ENFORCE,  # AUDIT, ENFORCE, or DISABLED
    task_precondition=lambda task: len(task) > 10,
    answer_postcondition=lambda ans: len(ans) > 0,
)
result = agent.run("What is 100 divided by 4?")

Key Features:

Zero external dependencies - uses only Python stdlib
Three agent architectures for different use cases
@tool decorator for easy function registration
Automatic JSON schema generation from type hints
Grammar-constrained generation for reliable tool calls
Contract-based validation with configurable policies
Comprehensive audit logging

See contract_agent.md for detailed ContractAgent documentation.

2. Stable Diffusion Integration

Full integration of stable-diffusion.cpp for image and video generation:

Simple Text-to-Image:

from cyllama.stablediffusion import text_to_image

images = text_to_image(
    model_path="models/sd_xl_turbo_1.0.q8_0.gguf",
    prompt="a photo of a cute cat sitting on a windowsill",
    width=512,
    height=512,
    sample_steps=4,  # Turbo models need fewer steps
    cfg_scale=1.0
)
images[0].save("output.png")

Advanced Generation with SDContext:

from cyllama.stablediffusion import (
    SDContext, SDContextParams,
    SampleMethod, Scheduler,
    set_progress_callback
)

# Progress tracking
def progress_cb(step, steps, time_ms):
    pct = (step / steps) * 100
    print(f'Step {step}/{steps} ({pct:.1f}%)')

set_progress_callback(progress_cb)

# Create context with full control
params = SDContextParams()
params.model_path = "models/sd_xl_turbo_1.0.q8_0.gguf"
params.n_threads = 4
params.vae_path = "models/vae.safetensors"  # Optional

ctx = SDContext(params)
images = ctx.generate(
    prompt="a beautiful mountain landscape at sunset",
    negative_prompt="blurry, ugly, distorted",
    width=512,
    height=512,
    sample_method=SampleMethod.EULER,
    scheduler=Scheduler.DISCRETE,
    seed=42
)

Image-to-Image:

from cyllama.stablediffusion import image_to_image, SDImage

init_img = SDImage.load("input.png")
images = image_to_image(
    model_path="models/sd_xl_turbo_1.0.q8_0.gguf",
    init_image=init_img,
    prompt="make it a watercolor painting",
    strength=0.75
)

ESRGAN Upscaling:

from cyllama.stablediffusion import Upscaler, SDImage

upscaler = Upscaler("models/esrgan-x4.bin")
img = SDImage.load("small.png")
upscaled = upscaler.upscale(img)  # 4x resolution
upscaled.save("large.png")

ControlNet with Canny Preprocessing:

from cyllama.stablediffusion import SDImage, canny_preprocess

img = SDImage.load("photo.png")
canny_preprocess(img, high_threshold=0.8, low_threshold=0.1)
# Use img as control image for ControlNet generation

CLI Tool:

# Generate image
python -m cyllama.stablediffusion generate \
    --model models/sd_xl_turbo_1.0.q8_0.gguf \
    --prompt "a beautiful sunset over mountains" \
    --output sunset.png \
    --steps 4 --cfg 1.0 --progress

# Upscale image
python -m cyllama.stablediffusion upscale \
    --model models/esrgan-x4.bin \
    --input image.png \
    --output image_4x.png

# Convert model format
python -m cyllama.stablediffusion convert \
    --input sd-v1-5.safetensors \
    --output sd-v1-5-q4_0.gguf \
    --type q4_0

# Show system info
python -m cyllama.stablediffusion info

Supported Models:

SD 1.x/2.x - Standard Stable Diffusion
SDXL/SDXL Turbo - High-quality generation (use cfg_scale=1.0, steps=1-4 for Turbo)
SD3/SD3.5 - Latest Stable Diffusion 3.x
FLUX - FLUX.1 models (dev, schnell)
Wan/CogVideoX - Video generation (use generate_video())
LoRA - Low-rank adaptation files
ControlNet - Conditional generation
ESRGAN - Image upscaling

Key Features:

Full numpy/PIL integration (SDImage.to_numpy(), SDImage.to_pil())
Progress, log, and preview callbacks
All samplers (Euler, Euler_A, DPM2, DPMPP2M, LCM, etc.)
All schedulers (Discrete, Karras, Exponential, AYS, etc.)
Model conversion with quantization support
Video generation for compatible models

3. Agent Client Protocol (ACP) Support

New ACP implementation for editor/IDE integration:

from cyllama.agents import ACPAgent

# ACP agent for editor integration (Zed, Neovim, etc.)
agent = ACPAgent(model_path="model.gguf")
agent.run()  # Starts JSON-RPC server over stdio

Features:

JSON-RPC 2.0 transport over stdio
Session management (new, load, prompt, cancel)
Tool permission flow for user approval
File operations delegated to editor
Terminal operations support

Current Status

Version: 0.1.12 (November 2025)
llama.cpp Version: b7126 (tracking bleeding-edge)
stable-diffusion.cpp: Integrated and tested
whisper.cpp: Integrated and tested
Tests: 600+ passing
Platform: macOS (primary), Linux (tested)

API Coverage - All Major Goals Met:

Component	Status
Core llama.cpp wrapper	Complete
High-level Python API	Complete
Agent Framework	Complete
Stable Diffusion	Complete
Multimodal (LLAVA)	Complete
Whisper.cpp	Complete
TTS	Complete
HTTP Servers	Complete
Framework Integrations	Complete

Why This Update Matters

Agents without dependencies: Build tool-using AI agents with just cyllama - no LangChain, no AutoGen, no external frameworks required. Three architectures cover different reliability/flexibility tradeoffs.

Image generation in Python: Generate images with the same library you use for LLM inference. Full control over samplers, schedulers, and all generation parameters. Support for the latest models including SDXL Turbo, SD3, and FLUX.

Production-ready: 600+ tests, comprehensive documentation, proper error handling. Ready for both quick prototyping and production use.

Quick Start Examples

# Text generation
from cyllama import complete
response = complete("What is Python?", model_path="model.gguf")

# Agent with tools
from cyllama import LLM
from cyllama.agents import ReActAgent, tool

@tool
def get_weather(city: str) -> str:
    return f"Weather in {city}: Sunny, 72F"

agent = ReActAgent(llm=LLM("model.gguf"), tools=[get_weather])
result = agent.run("What's the weather in Paris?")

# Image generation
from cyllama.stablediffusion import text_to_image
images = text_to_image(
    model_path="sd_xl_turbo_1.0.q8_0.gguf",
    prompt="a cyberpunk cityscape",
    sample_steps=4
)
images[0].save("cityscape.png")

# Speech transcription
from cyllama.whisper import WhisperContext
ctx = WhisperContext("whisper-base.bin")
result = ctx.transcribe("audio.wav")
print(result.text)

Resources

Repo: https://github.com/shakfu/cyllama
Docs: See docs/ directory
Examples: See tests/examples/ directory
- Agent examples (agent_*.py)
- Stable Diffusion examples (stablediffusion_*.py)
- Server implementations
- Multimodal demos

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cyllama: a thin cython wrapper around llama.cpp #10650

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

cyllama: a thin cython wrapper around llama.cpp #10650

Uh oh!

Uh oh!

shakfu Dec 4, 2024

Replies: 2 comments

Uh oh!

shakfu Nov 22, 2025 Author

cyllama Update - November 2025

Update on the cyllama Project

What's Changed Since December 2024

1. High-Level Python API

2. Chat Templates & Conversation Support

3. Text-to-Speech (TTS) Support

4. Multimodal (LLAVA/Vision) Support

5. Embedded HTTP Server

6. Framework Integrations

7. Performance Features

8. Utility Features

9. Quality of Life

Current Status

Why This Update Matters

Use Cases Now Supported

Resources

What's Next?

Feedback Welcome

Uh oh!

shakfu Nov 27, 2025 Author

cyllama Update - November 2025 (v0.1.12)

What's New in cyllama

Major New Features

1. Agent Framework (Zero Dependencies)

2. Stable Diffusion Integration

3. Agent Client Protocol (ACP) Support

Current Status

Why This Update Matters

Quick Start Examples

Resources

shakfu
Dec 4, 2024

shakfu
Nov 22, 2025
Author

shakfu
Nov 27, 2025
Author