docs: scope the OpenXLA multimodal / VLM track (#503)#567
Merged
Conversation
Add spike/openxla/MULTIMODAL_VLM_DESIGN.md, the Window E scoping design for epic #493. It grounds the multimodal track in the current OpenXLA/IREE backend and covers the four deliverables: encoder execution (host-encode-first by reusing the MLX vision stack, then an on-XLA ViT/SigLIP graph on the shared attention core), projector handling, embedding injection into the LM token stream, and the serve-path change to accept multimodal requests. The load-bearing change is embedding injection: the text graph accepts only token ids and gathers params['embed'] internally, so a VLM needs a new prefill-from-embeddings graph entry, an engine and C-ABI change to seed a slot from a merged embedding sequence, and a serve-path admit that decodes, encodes, projects, and merges instead of rejecting multimodal input. Records an explicit decision to defer the implementation to the follow-up epic #566, per the issue acceptance criteria, with a staged plan (foundation, serve-path admit, encoder execution, reference VLM, breadth, audio modality).
14 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Scoping design for Window E of epic #493: the multimodal / VLM (and audio-language) architecture track on the OpenXLA/IREE backend (Qwen2-VL / 2.5-VL / 3-VL, Gemma3n, Phi4MM, Molmo / Molmo2, Youtu-VL). Design-doc only; no code changes.
What changed
spike/openxla/MULTIMODAL_VLM_DESIGN.md, matching theSTAGE2_DESIGN.mdconvention. It grounds the track in the current codebase and covers the four deliverables the issue asks for:VisionModule/prepare_and_compute_vlm_embeddingsas a preprocessor) to unblock, then an on-XLA ViT / SigLIP encoder graph built on the shared attention core as the parity goal.token: i32and gathersparams['embed']internally, so a new prefill-from-embeddings graph entry (taking a[seq, hidden]tensor and skipping the gather) is the prerequisite for every VLM, with a text-only equivalence anchor as its first test.src/server/batch/xla_worker.rswith a decode + encode + merge + embeddings-seeded prefill admit.Design decision
Deferred to a dedicated follow-up epic, #566, per the issue acceptance criteria (design doc merged OR recorded deferral with a follow-up epic). Accepting a single multimodal request needs the prefill-from-embeddings graph entry, an engine + C-ABI change, a serve-path multimodal admit, and an encoder execution path, which is a foundation-sized effort across the emitter, the runtime shim, the engine, and the serve worker.
Test plan
emit_prefill,scale_embedding,Args,XlaBatchEngine,IreeRaggedLlama,xla_worker.rs:139,ModelRequest,VisionModule,get_input_embeddings,MergeStrategy,prepare_and_compute_vlm_embeddings) verified against the current tree.Closes #503