Summary
Image input to Ollama vision models (e.g. llava) is silently dropped. OllamaLanguageModel base64-encodes the image correctly but attaches it at the top level of the request body (images), while it POSTs to /api/chat — which only reads images from inside each message (messages[].images). Top-level images is an /api/generate field; /api/chat ignores it, so the model never receives the image and responds as if only text was sent.
Environment
- AnyLanguageModel
0.8.0
- Backend: Ollama (
/api/chat), model llava
Steps to reproduce
let model = OllamaLanguageModel(baseURL: URL(string: "http://localhost:11434")!, model: "llava")
let session = LanguageModelSession(model: model)
let image = Transcript.ImageSegment(/* JPEG data */)
let reply = try await session.respond(to: "What's in this image?", images: [image])
// reply describes nothing / asks for an image — the image was never delivered
Root cause
In Sources/AnyLanguageModel/Models/OllamaLanguageModel.swift:
The user message is built without images, and images are passed separately into createChatParams, which places them at the top level:
let (ollamaText, ollamaImages) = convertSegmentsToOllama(userSegments)
let messages = [ OllamaMessage(role: .user, content: ollamaText) ] // no images
...
let params = try createChatParams(
..., images: ollamaImages.isEmpty ? nil : ollamaImages, ... // top-level
)
let url = baseURL.appendingPathComponent("api/chat") // chat endpoint
createChatParams writes them to the request root:
if let images, !images.isEmpty {
params["images"] = .array(images.map { .string($0) }) // wrong level for /api/chat
}
And OllamaMessage has no images field at all:
private struct OllamaMessage: Hashable, Codable, Sendable {
enum Role: String, ... { case system, user, assistant, tool }
let role: Role
let content: String
// no `images`
}
Ollama's /api/chat expects:
{ "model": "llava", "messages": [ { "role": "user", "content": "...", "images": ["<base64>"] } ] }
This affects both the non-streaming and streaming chat paths (both build messages this way and POST to /api/chat).
Suggested fix
- Add an optional images field to
OllamaMessage:
private struct OllamaMessage: Hashable, Codable, Sendable {
let role: Role
let content: String
let images: [String]?
}
- Attach the images to the user message instead of the request root:
let messages = [
OllamaMessage(role: .user, content: ollamaText,
images: ollamaImages.isEmpty ? nil : ollamaImages)
]
- Remove the top-level
images handling from createChatParams (and drop the images: parameter), since /api/chat doesn't use it.
Summary
Image input to Ollama vision models (e.g.
llava) is silently dropped.OllamaLanguageModelbase64-encodes the image correctly but attaches it at the top level of the request body (images), while it POSTs to/api/chat— which only reads images from inside each message (messages[].images). Top-levelimagesis an/api/generatefield;/api/chatignores it, so the model never receives the image and responds as if only text was sent.Environment
0.8.0/api/chat), modelllavaSteps to reproduce
Root cause
In
Sources/AnyLanguageModel/Models/OllamaLanguageModel.swift:The user message is built without images, and images are passed separately into
createChatParams, which places them at the top level:createChatParamswrites them to the request root:And
OllamaMessagehas noimagesfield at all:Ollama's
/api/chatexpects:{ "model": "llava", "messages": [ { "role": "user", "content": "...", "images": ["<base64>"] } ] }This affects both the non-streaming and streaming chat paths (both build messages this way and POST to
/api/chat).Suggested fix
OllamaMessage:imageshandling fromcreateChatParams(and drop theimages:parameter), since/api/chatdoesn't use it.