Feat:(model) qwen image vae checkpoint#9108
Open
Pfannkuchensack wants to merge 6 commits intoinvoke-ai:mainfrom
Open
Feat:(model) qwen image vae checkpoint#9108Pfannkuchensack wants to merge 6 commits intoinvoke-ai:mainfrom
Pfannkuchensack wants to merge 6 commits intoinvoke-ai:mainfrom
Conversation
…pport Add standalone model types so Qwen Image can be run without downloading the full ~40 GB Diffusers pipeline. The VAE and Qwen2.5-VL encoder can now each come from their own model, with the Component Source (Diffusers) acting as a fallback for any submodel not provided separately.
Add a checkpoint loader for ComfyUI-style consolidated Qwen2.5-VL encoder files (e.g. qwen_2.5_vl_7b_fp8_scaled.safetensors), which bundle the language model and visual tower into one safetensors with FP8 + per-tensor weight_scale quantization. This drops the standalone encoder footprint from ~16 GB (Diffusers folder, FP16) to ~7 GB.
Add three new starter models so users can install a complete GGUF Qwen Image setup in one click without ever touching the full ~40 GB Diffusers pipeline: - "Qwen Image VAE" — single-file VAE checkpoint pulled from the Qwen-Image repo (~250 MB). - "Qwen2.5-VL Encoder (fp8 scaled)" — ComfyUI single-file FP8 encoder (~7 GB). - "Qwen2.5-VL Encoder (Diffusers)" — full-precision encoder via multi-folder HF download (text_encoder+tokenizer+processor, ~16 GB). The 8 GGUF main starters (Q2_K / Q4_K_M / Q6_K / Q8_0 for both Edit and txt2img) now declare the VAE + fp8 encoder as dependencies, so installing any of them automatically pulls in everything needed to generate. The fp8 encoder is preferred as the default dependency since it's smaller and the on-the-fly dequantization is essentially free at runtime. The Qwen Image starter bundle gets the VAE and fp8 encoder prepended so the bundled Lightning LoRA variants also benefit.
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds standalone model support for Qwen Image so users no longer need the full ~40 GB Diffusers pipeline. A GGUF transformer can now be combined with a standalone VAE checkpoint, a standalone Qwen2.5-VL encoder (Diffusers folder or ComfyUI single-file fp8), and the Component Source (Diffusers) field becomes a fallback rather than a hard requirement. All standalone components are also exposed as installable starter models, so a fully working GGUF setup can be installed in one click.
Why: The Qwen Image PR (#9000) only allowed loading the VAE and text encoder from the full Diffusers pipeline. That meant ~40 GB on disk just to use a tiny VAE (~250 MB) plus the encoder (~16 GB), and re-downloading both for every model variant. The smallest fully-standalone setup with this PR drops to ~12 GB (GGUF transformer + ~250 MB VAE + ~7 GB ComfyUI fp8 encoder).
How:
Backend
VAE_Checkpoint_QwenImage_Configdetects single-file Qwen Image VAEs via 5D conv weights +z_dim=16and loads them viaAutoencoderKLQwenImage(init_empty_weights+load_state_dict). The generic VAE checkpoint matcher now explicitly excludes Qwen Image VAEs so they aren't misclassified as FLUX.ModelType.QwenVLEncoder+ModelFormat.QwenVLEncoderwithQwenVLEncoder_Diffusers_Configrecognising directories that containtext_encoder/(withQwen2_5_VLForConditionalGeneration/Qwen2VLForConditionalGeneration) +tokenizer/. The newQwenVLEncoderLoaderhandlesTokenizerandTextEncodersubmodel loading from the folder layout.QwenVLEncoder_Checkpoint_Configmatches consolidated single-file checkpoints (e.g.qwen_2.5_vl_7b_fp8_scaled.safetensors) by detecting both LM keys (model.embed_tokens/model.layers.*) and visual tower keys (visual.patch_embed.*/visual.blocks.*). The newQwenVLEncoderCheckpointLoaderloads the safetensors, dequantises ComfyUI fp8 weights viaweight * weight_scale(with block-wise expansion, mirroring the Z-Image Qwen3 loader), stripscomfy_quant/weight_scale/scaled_fp8metadata, fetches the architecture config fromQwen/Qwen2.5-VL-7B-Instruct(offline-cache fallback), and instantiatesQwen2_5_VLForConditionalGenerationviainit_empty_weights+assignload. Tokenizer comes from the same HF repo with offline fallback.qwen_image_text_encoder.pynow branches on whethermodel_rootis a file. Single-file checkpoints get tokenizer + image processor from HuggingFace (Qwen/Qwen2.5-VL-7B-Instruct, ~10 MB, cached); the existing folder layout path is unchanged. BnB-quantised loading falls back to the cached encoder for single-file checkpoints since BnB can't load from a bare safetensors and the file is already FP8.QwenImageModelLoaderInvocationgains optionalvae_modelandqwen_vl_encoder_modelfields. Resolution priority for each component: standalone override → main model (if Diffusers) → Component Source. Bumped tov1.2.0.Qwen Image VAE(single-file checkpoint, ~250 MB),Qwen2.5-VL Encoder (fp8 scaled)(ComfyUI single-file, ~7 GB), andQwen2.5-VL Encoder (Diffusers)(multi-folder HF downloadtext_encoder+tokenizer+processor, ~16 GB). All 8 GGUF main starters (Q2_K / Q4_K_M / Q6_K / Q8_0 for both Edit and txt2img) declare the VAE + fp8 encoder asdependencies, so installing any of them auto-installs a complete generation-ready setup. The Qwen Image starter bundle gets the VAE and fp8 encoder prepended too.Frontend
qwenImageVaeModelandqwenImageQwenVLEncoderModel, plus a migration entry.useQwenImageVAEModels/useQwenVLEncoderModelshooks,isQwenImageVAEModelConfig/isQwenVLEncoderModelConfigtype guards, and Model Manager category + format badge entries.schema.tspatched manually for the newModelType/ModelFormatvalues, theQwenVLEncoder_Diffusers_ConfigandQwenVLEncoder_Checkpoint_Configschemas, the new loader fields, and theAnyModelConfigunion.Related Issues / Discussions
Follow-up to #9000 (Qwen Image full pipeline support). Closes the standalone-component gap that was called out for users with limited disk space.
QA Instructions
Quickest verification (recommended):
Install one of the GGUF starter models (e.g.
Qwen Image Edit 2511 (Q4_K_M)) from the starter list. The VAE and fp8 encoder should be auto-installed as dependencies, and the model should generate without any further configuration.Setup options for manual testing:
Qwen Image VAEfrom the starter list (or downloadvae/diffusion_pytorch_model.safetensorsfrom a Qwen Image HF repo manually, ~250 MB). Verify it's identified as a Qwen Image VAE checkpoint.Qwen2.5-VL Encoder (Diffusers)from the starter list (or downloadtext_encoder/+tokenizer/(+ optionallyprocessor/) fromQwen/Qwen-Image-Edit-2511manually). Verify it's identified asqwen_vl_encoder/qwen_vl_encoder.Qwen2.5-VL Encoder (fp8 scaled)from the starter list (orqwen_2.5_vl_7b_fp8_scaled.safetensorsdirectly, ~7 GB). Verify it's identified asqwen_vl_encoder/checkpoint. First generation will fetch the tokenizer + processor configs fromQwen/Qwen2.5-VL-7B-Instruct(~10 MB) and cache them.Cases to verify on the Qwen Image generation tab:
int8/nf4) still works against a standalone encoder folder. Single-file Encoder +int8/nf4falls back to the cached non-BnB path (still works, no error).Starter model checks:
Qwen Image VAE,Qwen2.5-VL Encoder (fp8 scaled),Qwen2.5-VL Encoder (Diffusers).Automated checks:
pytest tests/app/invocations/test_qwen_image_model_loader.py tests/backend/model_manager/configs/— 16 passed.pytest -k "qwen_image"(excluding unrelated PILget_flattened_datatest) — 53 passed.pnpm lint:tsc/pnpm lint:eslint/pnpm lint:prettier/pnpm lint:knipall green.Merge Plan
Standard merge.
Checklist
What's Newcopy (if doing a release after this PR)