Skip to content

docs: scope the OpenXLA multimodal / VLM track (#503)#567

Merged
inureyes merged 1 commit into
mainfrom
feature/issue-503-multimodal-vlm-design
Jul 1, 2026
Merged

docs: scope the OpenXLA multimodal / VLM track (#503)#567
inureyes merged 1 commit into
mainfrom
feature/issue-503-multimodal-vlm-design

Conversation

@inureyes

@inureyes inureyes commented Jul 1, 2026

Copy link
Copy Markdown
Member

Summary

Scoping design for Window E of epic #493: the multimodal / VLM (and audio-language) architecture track on the OpenXLA/IREE backend (Qwen2-VL / 2.5-VL / 3-VL, Gemma3n, Phi4MM, Molmo / Molmo2, Youtu-VL). Design-doc only; no code changes.

What changed

  • Add spike/openxla/MULTIMODAL_VLM_DESIGN.md, matching the STAGE2_DESIGN.md convention. It grounds the track in the current codebase and covers the four deliverables the issue asks for:
    • Encoder execution: host-encode-first (reuse the MLX vision stack VisionModule / prepare_and_compute_vlm_embeddings as a preprocessor) to unblock, then an on-XLA ViT / SigLIP encoder graph built on the shared attention core as the parity goal.
    • Projector handling: the linear / MLP / pool connector, host-side or as a small emitted graph.
    • Embedding injection into the LM token stream: the load-bearing emitter change. Today the graph accepts only token: i32 and gathers params['embed'] internally, so a new prefill-from-embeddings graph entry (taking a [seq, hidden] tensor and skipping the gather) is the prerequisite for every VLM, with a text-only equivalence anchor as its first test.
    • Serve-path changes: replace the multimodal rejection in src/server/batch/xla_worker.rs with a decode + encode + merge + embeddings-seeded prefill admit.
  • Also covers the M-RoPE (Qwen2-VL) and Gemma3 bidirectional image-mask subtleties (reusing the refactor: share the full per-layer attention core across all emitter graph kinds #494 / feat: implement Gemma2 sliding-window attention (local/global alternation) #495 hooks) and the audio modality (Phi4MM, Gemma3n).
  • Records an explicit decision to DEFER the implementation, with a staged plan.

Design decision

Deferred to a dedicated follow-up epic, #566, per the issue acceptance criteria (design doc merged OR recorded deferral with a follow-up epic). Accepting a single multimodal request needs the prefill-from-embeddings graph entry, an engine + C-ABI change, a serve-path multimodal admit, and an encoder execution path, which is a foundation-sized effort across the emitter, the runtime shim, the engine, and the serve worker.

Test plan

  • Documentation-only change; no source, build, or test surface touched.
  • File references in the doc (emit_prefill, scale_embedding, Args, XlaBatchEngine, IreeRaggedLlama, xla_worker.rs:139, ModelRequest, VisionModule, get_input_embeddings, MergeStrategy, prepare_and_compute_vlm_embeddings) verified against the current tree.
  • No AI attribution and no em dashes.

Closes #503

Add spike/openxla/MULTIMODAL_VLM_DESIGN.md, the Window E scoping design for epic #493. It grounds the multimodal track in the current OpenXLA/IREE backend and covers the four deliverables: encoder execution (host-encode-first by reusing the MLX vision stack, then an on-XLA ViT/SigLIP graph on the shared attention core), projector handling, embedding injection into the LM token stream, and the serve-path change to accept multimodal requests.

The load-bearing change is embedding injection: the text graph accepts only token ids and gathers params['embed'] internally, so a VLM needs a new prefill-from-embeddings graph entry, an engine and C-ABI change to seed a slot from a merged embedding sequence, and a serve-path admit that decodes, encodes, projects, and merges instead of rejecting multimodal input.

Records an explicit decision to defer the implementation to the follow-up epic #566, per the issue acceptance criteria, with a staged plan (foundation, serve-path admit, encoder execution, reference VLM, breadth, audio modality).
@inureyes inureyes added type:docs Documentation improvements or additions priority:low Low priority area:architecture Architecture and code structure changes area:docs User and developer documentation status:review Under review labels Jul 1, 2026
@inureyes inureyes merged commit 12a1689 into main Jul 1, 2026
5 checks passed
@inureyes inureyes deleted the feature/issue-503-multimodal-vlm-design branch July 1, 2026 05:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:architecture Architecture and code structure changes area:docs User and developer documentation priority:low Low priority status:review Under review type:docs Documentation improvements or additions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

design: scope the multimodal / VLM track (Qwen2-VL/2.5-VL/3-VL, Gemma3n, Phi4MM, Molmo, and more)

1 participant