cyllama: a thin cython wrapper around llama.cpp #10650
Replies: 2 comments
-
cyllama Update - November 2025Update on the cyllama ProjectIt's been nearly a year since my last announcement, and I wanted to share what's new with cyllama - the thin cython wrapper for llama.cpp. A quick reminder: cyllama is a minimal, performant, compiled Python extension wrapping llama.cpp's core functionality. It statically links libllama.a and libggml.a for simplicity and performance (~1.2 MB wheel). What's Changed Since December 2024Thanks to the targeted use of AI agents, the project has managed to keep up with the fast pace of changes at llama.cpp and is currently tracking release 1. High-Level Python APIWe now have a complete, Pythonic API layer that makes cyllama more pleasant to use: from cyllama import complete, chat, LLM
# Simple one-liner
response = complete("What is Python?", model_path="model.gguf")
# Reusable LLM instance (model stays loaded)
llm = LLM("model.gguf")
response = llm("Your question here")
# Multi-turn chat with proper message formatting
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing"}
]
response = chat(messages, model_path="model.gguf")Why this matters: Previously, you had to manually manage models, contexts, samplers, and batches. Now it's automatic with sensible defaults, but full control is still available when needed. 2. Chat Templates & Conversation SupportFull support for chat templates and multi-turn conversations through the high-level API: from cyllama import chat
# Multi-turn conversation with automatic template formatting
messages = [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "What is Python?"}
]
response = chat(messages, model_path="model.gguf")
# Or use the Chat class for interactive CLI
from cyllama.llama.chat import Chat
chat_session = Chat(model_path="model.gguf")
chat_session.chat_loop() # Interactive chat with template auto-detectionFeatures:
3. Text-to-Speech (TTS) SupportFull TTS integration for voice generation: from cyllama.llama import TTSGenerator
tts = TTSGenerator("models/outetts-0.2-500M-Q8_0.gguf")
# Generate speech from text
tts.generate(
text="Hello, this is a test of the text to speech system.",
output_file="output.wav"
)Features:
4. Multimodal (LLAVA/Vision) SupportVision-language models for image understanding: from cyllama.llama.mtmd import MultimodalProcessor, VisionLanguageChat
from cyllama import LlamaModel, LlamaContext
# Load model and create processor
model = LlamaModel("models/llava-v1.6-mistral-7b.Q4_K_M.gguf")
ctx = LlamaContext(model)
# Initialize vision processor
processor = MultimodalProcessor("models/mmproj-model-f16.gguf", model)
# Or use high-level chat interface
vision_chat = VisionLanguageChat("models/mmproj-model-f16.gguf", model, ctx)
response = vision_chat.ask_about_image("What's in this image?", "image.jpg")Capabilities:
5. Embedded HTTP ServerEmbedded HTTP servers with OpenAI-compatible API: from cyllama.llama.server import PythonServer
# Create server with configuration
server = PythonServer(
model_path="model.gguf",
host="127.0.0.1",
port=8080
)
# Start server (runs in background thread)
server.start()
# Server provides OpenAI-compatible endpoints:
# POST /v1/chat/completions
# POST /v1/completions
# GET /v1/models
# GET /healthServer Features:
Example with curl: curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7
}'6. Framework IntegrationsOpenAI-Compatible API: from cyllama.integrations import OpenAIClient
client = OpenAIClient(model_path="model.gguf")
response = client.chat.completions.create(
messages=[{"role": "user", "content": "Hello!"}],
temperature=0.7
)LangChain: from cyllama.integrations import CyllamaLLM
from langchain.chains import LLMChain
llm = CyllamaLLM(model_path="model.gguf", temperature=0.7)
chain = LLMChain(llm=llm, prompt=prompt_template)
result = chain.run(topic="AI")Both work seamlessly with existing code expecting OpenAI or LangChain interfaces. 7. Performance Features
8. Utility Features
9. Quality of Life
Current StatusVersion: 0.1.9 (November 21, 2025) API Coverage - All Major Goals Met:
Why This Update MattersBefore: You needed 50+ lines of boilerplate to do basic inference, manually managing model lifecycle. Now: One line for simple cases, with full power available when needed: # Text generation - one line!
response = complete("Your prompt", model_path="model.gguf")
# Chat conversations - easy!
response = chat(messages, model_path="model.gguf")
# TTS - simple!
tts.generate("Hello world", "output.wav")
# Vision - straightforward!
response = vision_chat.ask_about_image("What's in this?", "image.jpg")
# HTTP server
server = PythonServer(model_path="model.gguf")
server.start()The library is now genuinely ready for:
Use Cases Now Supported
Resources
What's Next?Potential future work:
Feedback WelcomeAs always, if you try it out:
The goal remains: stay lean, stay fast, stay current with llama.cpp, and make it easy to use from Python. |
Beta Was this translation helpful? Give feedback.
-
cyllama Update - November 2025 (v0.1.12)What's New in cyllamaThis release brings two major new capabilities: a zero-dependency Agent Framework and Stable Diffusion image generation support. A quick reminder: cyllama is a performant, compiled Cython wrapper for llama.cpp that provides both low-level access and a high-level Pythonic API. It statically links the core libraries for simplicity and performance. Major New Features1. Agent Framework (Zero Dependencies)cyllama now includes a complete agent framework with three agent architectures, all with zero external dependencies: ReActAgent - Reasoning + Acting agent with tool calling: from cyllama import LLM
from cyllama.agents import ReActAgent, tool
@tool
def calculate(expression: str) -> str:
"""Evaluate a math expression."""
return str(eval(expression))
@tool
def search(query: str) -> str:
"""Search for information."""
return f"Results for: {query}"
llm = LLM("model.gguf")
agent = ReActAgent(llm=llm, tools=[calculate, search])
result = agent.run("What is 25 * 4 + 10?")
print(result.answer) # "The result is 110"ConstrainedAgent - Grammar-enforced tool calling for 100% reliability: from cyllama.agents import ConstrainedAgent
# Uses GBNF grammars to guarantee valid JSON tool calls
agent = ConstrainedAgent(llm=llm, tools=[calculate])
result = agent.run("Calculate 100 / 4") # Always produces valid tool callsContractAgent - Contract-based agent with pre/post conditions (C++26-inspired): from cyllama.agents import ContractAgent, tool, pre, post, ContractPolicy
@tool
@pre(lambda args: args['x'] != 0, "cannot divide by zero")
@post(lambda r: r is not None, "result must not be None")
def divide(a: float, x: float) -> float:
"""Divide a by x."""
return a / x
agent = ContractAgent(
llm=llm,
tools=[divide],
policy=ContractPolicy.ENFORCE, # AUDIT, ENFORCE, or DISABLED
task_precondition=lambda task: len(task) > 10,
answer_postcondition=lambda ans: len(ans) > 0,
)
result = agent.run("What is 100 divided by 4?")Key Features:
See contract_agent.md for detailed ContractAgent documentation. 2. Stable Diffusion IntegrationFull integration of stable-diffusion.cpp for image and video generation: Simple Text-to-Image: from cyllama.stablediffusion import text_to_image
images = text_to_image(
model_path="models/sd_xl_turbo_1.0.q8_0.gguf",
prompt="a photo of a cute cat sitting on a windowsill",
width=512,
height=512,
sample_steps=4, # Turbo models need fewer steps
cfg_scale=1.0
)
images[0].save("output.png")Advanced Generation with SDContext: from cyllama.stablediffusion import (
SDContext, SDContextParams,
SampleMethod, Scheduler,
set_progress_callback
)
# Progress tracking
def progress_cb(step, steps, time_ms):
pct = (step / steps) * 100
print(f'Step {step}/{steps} ({pct:.1f}%)')
set_progress_callback(progress_cb)
# Create context with full control
params = SDContextParams()
params.model_path = "models/sd_xl_turbo_1.0.q8_0.gguf"
params.n_threads = 4
params.vae_path = "models/vae.safetensors" # Optional
ctx = SDContext(params)
images = ctx.generate(
prompt="a beautiful mountain landscape at sunset",
negative_prompt="blurry, ugly, distorted",
width=512,
height=512,
sample_method=SampleMethod.EULER,
scheduler=Scheduler.DISCRETE,
seed=42
)Image-to-Image: from cyllama.stablediffusion import image_to_image, SDImage
init_img = SDImage.load("input.png")
images = image_to_image(
model_path="models/sd_xl_turbo_1.0.q8_0.gguf",
init_image=init_img,
prompt="make it a watercolor painting",
strength=0.75
)ESRGAN Upscaling: from cyllama.stablediffusion import Upscaler, SDImage
upscaler = Upscaler("models/esrgan-x4.bin")
img = SDImage.load("small.png")
upscaled = upscaler.upscale(img) # 4x resolution
upscaled.save("large.png")ControlNet with Canny Preprocessing: from cyllama.stablediffusion import SDImage, canny_preprocess
img = SDImage.load("photo.png")
canny_preprocess(img, high_threshold=0.8, low_threshold=0.1)
# Use img as control image for ControlNet generationCLI Tool: # Generate image
python -m cyllama.stablediffusion generate \
--model models/sd_xl_turbo_1.0.q8_0.gguf \
--prompt "a beautiful sunset over mountains" \
--output sunset.png \
--steps 4 --cfg 1.0 --progress
# Upscale image
python -m cyllama.stablediffusion upscale \
--model models/esrgan-x4.bin \
--input image.png \
--output image_4x.png
# Convert model format
python -m cyllama.stablediffusion convert \
--input sd-v1-5.safetensors \
--output sd-v1-5-q4_0.gguf \
--type q4_0
# Show system info
python -m cyllama.stablediffusion infoSupported Models:
Key Features:
3. Agent Client Protocol (ACP) SupportNew ACP implementation for editor/IDE integration: from cyllama.agents import ACPAgent
# ACP agent for editor integration (Zed, Neovim, etc.)
agent = ACPAgent(model_path="model.gguf")
agent.run() # Starts JSON-RPC server over stdioFeatures:
Current StatusVersion: 0.1.12 (November 2025) API Coverage - All Major Goals Met:
Why This Update MattersAgents without dependencies: Build tool-using AI agents with just cyllama - no LangChain, no AutoGen, no external frameworks required. Three architectures cover different reliability/flexibility tradeoffs. Image generation in Python: Generate images with the same library you use for LLM inference. Full control over samplers, schedulers, and all generation parameters. Support for the latest models including SDXL Turbo, SD3, and FLUX. Production-ready: 600+ tests, comprehensive documentation, proper error handling. Ready for both quick prototyping and production use. Quick Start Examples# Text generation
from cyllama import complete
response = complete("What is Python?", model_path="model.gguf")
# Agent with tools
from cyllama import LLM
from cyllama.agents import ReActAgent, tool
@tool
def get_weather(city: str) -> str:
return f"Weather in {city}: Sunny, 72F"
agent = ReActAgent(llm=LLM("model.gguf"), tools=[get_weather])
result = agent.run("What's the weather in Paris?")
# Image generation
from cyllama.stablediffusion import text_to_image
images = text_to_image(
model_path="sd_xl_turbo_1.0.q8_0.gguf",
prompt="a cyberpunk cityscape",
sample_steps=4
)
images[0].save("cityscape.png")
# Speech transcription
from cyllama.whisper import WhisperContext
ctx = WhisperContext("whisper-base.bin")
result = ctx.transcribe("audio.wav")
print(result.text)Resources
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
H folks,
Ok, this is my show and tell 😄
In case anyone's interested, I've been working for some time on the open-source cyllama project, a thin cython wrapper for llama.cpp. It was spun-off from an earlier, now frozen, llama.cpp wrapper project, llamalib which provided early stage, but functional, wrappers using cython, pybind11, and nanobind.
In cyllama,
libllama.a,libggml.a, and other related static libs are statically linked to this python extension for simplicity and performance: as a wheel it's around 1.2 MB. It can perform basic inference via a high-level and lower-level interface wrappingllama.hand parts ofcommon.hand others as necessary. It generally tries to keep up with the latest changes in llama.cpp while maintaining some kind of stability in terms of all tests passing and error-free compilation in between updates.Development goals are to:
Stay up-to-date with bleeding-edge llama.cpp.
Produce a minimal, performant, compiled, thin python wrapper around the core llama-cli feature-set of llama.cpp.
Integrate and wrap llava-cli features.
Integrate and wrap features from related projects such as whisper.cpp and stable-diffusion.cpp
Learn about the internals of this popular C++/C LLM inference engine along the way. This is definitely the most efficient way, for me at least, to learn about the underlying technologies.
If you try it, please provide feedback, ask questions, post-bugs, etc., -- any contributions are welcome!
Beta Was this translation helpful? Give feedback.
All reactions