docs: scope the OpenXLA multimodal / VLM track (#503) by inureyes · Pull Request #567 · lablup/mlxcel

inureyes · 2026-07-01T05:16:28Z

Summary

Scoping design for Window E of epic #493: the multimodal / VLM (and audio-language) architecture track on the OpenXLA/IREE backend (Qwen2-VL / 2.5-VL / 3-VL, Gemma3n, Phi4MM, Molmo / Molmo2, Youtu-VL). Design-doc only; no code changes.

What changed

Add spike/openxla/MULTIMODAL_VLM_DESIGN.md, matching the STAGE2_DESIGN.md convention. It grounds the track in the current codebase and covers the four deliverables the issue asks for:
- Encoder execution: host-encode-first (reuse the MLX vision stack VisionModule / prepare_and_compute_vlm_embeddings as a preprocessor) to unblock, then an on-XLA ViT / SigLIP encoder graph built on the shared attention core as the parity goal.
- Projector handling: the linear / MLP / pool connector, host-side or as a small emitted graph.
- Embedding injection into the LM token stream: the load-bearing emitter change. Today the graph accepts only token: i32 and gathers params['embed'] internally, so a new prefill-from-embeddings graph entry (taking a [seq, hidden] tensor and skipping the gather) is the prerequisite for every VLM, with a text-only equivalence anchor as its first test.
- Serve-path changes: replace the multimodal rejection in src/server/batch/xla_worker.rs with a decode + encode + merge + embeddings-seeded prefill admit.
Also covers the M-RoPE (Qwen2-VL) and Gemma3 bidirectional image-mask subtleties (reusing the refactor: share the full per-layer attention core across all emitter graph kinds #494 / feat: implement Gemma2 sliding-window attention (local/global alternation) #495 hooks) and the audio modality (Phi4MM, Gemma3n).
Records an explicit decision to DEFER the implementation, with a staged plan.

Design decision

Deferred to a dedicated follow-up epic, #566, per the issue acceptance criteria (design doc merged OR recorded deferral with a follow-up epic). Accepting a single multimodal request needs the prefill-from-embeddings graph entry, an engine + C-ABI change, a serve-path multimodal admit, and an encoder execution path, which is a foundation-sized effort across the emitter, the runtime shim, the engine, and the serve worker.

Test plan

Documentation-only change; no source, build, or test surface touched.
File references in the doc (emit_prefill, scale_embedding, Args, XlaBatchEngine, IreeRaggedLlama, xla_worker.rs:139, ModelRequest, VisionModule, get_input_embeddings, MergeStrategy, prepare_and_compute_vlm_embeddings) verified against the current tree.
No AI attribution and no em dashes.

Closes #503

Add spike/openxla/MULTIMODAL_VLM_DESIGN.md, the Window E scoping design for epic #493. It grounds the multimodal track in the current OpenXLA/IREE backend and covers the four deliverables: encoder execution (host-encode-first by reusing the MLX vision stack, then an on-XLA ViT/SigLIP graph on the shared attention core), projector handling, embedding injection into the LM token stream, and the serve-path change to accept multimodal requests. The load-bearing change is embedding injection: the text graph accepts only token ids and gathers params['embed'] internally, so a VLM needs a new prefill-from-embeddings graph entry, an engine and C-ABI change to seed a slot from a merged embedding sequence, and a serve-path admit that decodes, encodes, projects, and merges instead of rejecting multimodal input. Records an explicit decision to defer the implementation to the follow-up epic #566, per the issue acceptance criteria, with a staged plan (foundation, serve-path admit, encoder execution, reference VLM, breadth, audio modality).

inureyes added type:docs Documentation improvements or additions priority:low Low priority area:architecture Architecture and code structure changes area:docs User and developer documentation status:review Under review labels Jul 1, 2026

inureyes merged commit 12a1689 into main Jul 1, 2026
5 checks passed

inureyes deleted the feature/issue-503-multimodal-vlm-design branch July 1, 2026 05:17

inureyes mentioned this pull request Jul 1, 2026

epic: OpenXLA backend architecture-coverage parity with the MLX engine #493

Closed

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: scope the OpenXLA multimodal / VLM track (#503)#567

docs: scope the OpenXLA multimodal / VLM track (#503)#567
inureyes merged 1 commit into
mainfrom
feature/issue-503-multimodal-vlm-design

inureyes commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented Jul 1, 2026

Summary

What changed

Design decision

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant