Skip to content

feat: ernie image/turbo#9115

Draft
Pfannkuchensack wants to merge 9 commits intoinvoke-ai:mainfrom
Pfannkuchensack:feature/ernie-image
Draft

feat: ernie image/turbo#9115
Pfannkuchensack wants to merge 9 commits intoinvoke-ai:mainfrom
Pfannkuchensack:feature/ernie-image

Conversation

@Pfannkuchensack
Copy link
Copy Markdown
Collaborator

Summary

Adds Baidu ERNIE-Image and ERNIE-Image-Turbo (HF, HF Turbo) as a new BaseModelType, mirroring the FLUX.2 Klein and Z-Image integration patterns.

The two checkpoints share the same ErnieImageTransformer2DModel architecture (3072 hidden, 24 layers, 24 heads, Mistral3 text encoder, AutoencoderKLFlux2 VAE) and only differ in inference defaults (50 steps + CFG 4.0 vs. 8 steps + CFG 1.0 for Turbo), so they live under one BaseModelType.ErnieImage without a variant enum. The optional 3B Mistral3-based prompt enhancer that ships with the pipeline is wired through with a UI toggle.

This PR builds on top of #8859 (transformers 5.1+) and additionally bumps diffusers 0.36.0 → 0.38.0, which is the first release containing ErnieImagePipeline and ErnieImageTransformer2DModel.

Backend changes

  • BaseModelType.ErnieImage, ModelType.PromptEnhancer, plus two new SubModelTypes (pe, pe_tokenizer) for the bundled prompt enhancer.
  • Main_Diffusers_ErnieImage_Config and Main_Checkpoint_ErnieImage_Config with state-dict-based detection (x_embedder + text_proj + adaLN_modulation).
  • A diffusers loader that loads transformer / vae / text_encoder / tokenizer plus optional pe / pe_tokenizer from a single pipeline directory.
  • invokeai/backend/ernie_image/ with sampling utilities (2×2 patchify, BN normalize/denormalize, sigma schedule, padded text packing) and a rectified-flow denoise loop supporting Euler / Heun / LCM (reusing the FlowMatch* schedulers from FLUX).
  • Five new invocations: ernie_image_model_loader, ernie_image_text_encoder (with prompt-enhancer toggle), ernie_image_denoise, ernie_image_vae_encode, ernie_image_vae_decode.
  • ErnieImageConditioningInfo + conditioning field/output, including pickle allowlist for the disk serializer.
  • Generation modes ernie_image_{txt2img,img2img,inpaint,outpaint}.
  • Starter models for baidu/ERNIE-Image and baidu/ERNIE-Image-Turbo, plus a STARTER_BUNDLES entry.

Frontend changes

  • services/api/schema.ts regenerated to expose the new node types.
  • Type unions extended (ImageOutput, LatentToImage, ImageToLatents, DenoiseLatents, MainModelLoaderNodes).
  • ParamsState gains ernieImageScheduler and ernieImageUsePromptEnhancer, with reducers, selectors, and selectIsErnieImage.
  • buildErnieImageGraph (txt2img / img2img / inpaint / outpaint) wired into useEnqueueCanvas.
  • ERNIE entries added to MODEL_BASE_TO_{COLOR,LONG_NAME,SHORT_NAME}; prompt_enhancer added to MODEL_TYPE_TO_LONG_NAME.
  • ParamErnieImageScheduler and ParamErnieImagePromptEnhancer rendered conditionally in GenerationSettingsAccordion.
  • All add{TextTo,ImageTo,Inpaint,Outpaint}Image rectified-flow type guards extended to accept ernie_image_denoise; isMainModelWithoutUnet likewise.

Diffusers 0.38 fix-out

  • hotfixes.py: import LoRACompatibleConv directly. The lazy module loader in 0.38 no longer exposes diffusers.models.lora as an attribute, so the legacy patch path crashes on import.
  • pyproject.toml declares prerelease = "allow" under [tool.uv], with a comment explaining that diffusers 0.38.0 itself hard-pins safetensors>=0.8.0-rc.0. We can drop this again once a diffusers patch release ships with a stable safetensors floor.

Related Issues / Discussions

QA Instructions

Automated

  • pytest tests/ -m "not slow" — passes (619 / 619, the same tests/model_identification LFS-skip as on main).
  • pnpm lint:tsc, pnpm lint:eslint, pnpm lint:prettier — all clean.
  • pnpm test:no-watch — 563 / 563 frontend tests pass.

Manual smoketest of pre-existing models (regression for the diffusers / transformers bump)

Pick at least two of these and run a single txt2img each. Confirm no crashes and visually-plausible output:

  • SDXL
  • FLUX.1 Dev
  • FLUX.2 Klein
  • SD3
  • Z-Image-Turbo

Plus the relevant items from #8859's test plan: SD 1.5 prompt-weighted generation (compel path), FLUX text-to-image (T5 tokenizer path), HF model install via repo ID, NSFW checker first-time download.

Manual smoketest of ERNIE-Image (requires a GPU — 8B parameters)

  1. Install baidu/ERNIE-Image-Turbo from the new starter bundle. The Model Manager should classify it as BaseModelType.ErnieImage.
  2. Pick the model on the Generate tab. Confirm the scheduler dropdown and Prompt Enhancer toggle appear in the Generation accordion.
  3. txt2img at 1024×1024, 8 steps, CFG 1.0 — should produce an image.
  4. Toggle the prompt enhancer on with a short prompt (e.g. "a fox"); the enhancer log line in the backend should show a rewritten longer prompt before encoding.
  5. Repeat with baidu/ERNIE-Image (50 steps, CFG 4.0).
  6. img2img / inpaint / outpaint — one run each on the canvas tab.

Merge Plan

  • This PR depends on Update to transformers 5.1.0 #8859 being merged first (or being merged in together, since they share the transformers>=5.1.0 override). I have no preference; happy to rebase whenever Update to transformers 5.1.0 #8859 lands.
  • After this PR merges, a follow-up release should call out the diffusers and transformers major bumps in the changelog. The ERNIE-Image starter bundle is gated behind those bumps.
  • The prerelease = "allow" line in pyproject.toml is a temporary measure tied to diffusers 0.38.0's safetensors>=0.8.0-rc.0 upstream pin. Worth revisiting (and removing) once a diffusers patch release relaxes that requirement.

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable) — backend regression suite covers the new code paths via existing model-config and loader-registry tests; no new dedicated unit tests for the ERNIE sampling helpers (the patchify roundtrip was sanity-checked manually).
  • ❗Changes to a redux slice have a corresponding migration — N/A: only additive fields with defaults in paramsSlice.
  • Documentation added / updated (if applicable) — none yet; the docs-old/contributing/NEW_MODEL_INTEGRATION.md checklist describes the integration shape this PR follows.
  • Updated What's New copy (if doing a release after this PR)

Out of scope (planned follow-ups)

  • ControlNet, IP-Adapter, and LoRA support for ERNIE-Image.
  • Single-file checkpoint loading (the Main_Checkpoint_ErnieImage_Config is in place as defensive scaffolding, but the loader currently raises NotImplementedError for that format).
  • Metadata recall handlers in the gallery side panel for ERNIE-specific parameters.

Your Name and others added 4 commits February 6, 2026 19:58
Adds Baidu ERNIE-Image and ERNIE-Image-Turbo as a new BaseModelType,
mirroring the FLUX.2 / Z-Image integration pattern. Both models share
the ErnieImageTransformer2DModel architecture (3072 hidden, 24 layers,
24 heads) and an AutoencoderKLFlux2 VAE; they differ only in default
inference settings (50 steps + CFG 4.0 vs 8 steps + CFG 1.0 for Turbo).

Built on top of PR invoke-ai#8859 (transformers 5.1+) and additionally bumps
diffusers 0.36.0 -> 0.38.0, which is the first release containing the
ErnieImagePipeline and ErnieImageTransformer2DModel.

Backend
- BaseModelType.ErnieImage, ModelType.PromptEnhancer, two new
  SubModelTypes (pe, pe_tokenizer) for the bundled prompt enhancer
- Main_Diffusers_ErnieImage_Config + Main_Checkpoint_ErnieImage_Config
  with state-dict-based detection (x_embedder + text_proj + adaLN_modulation)
- Diffusers loader registered for ERNIE-Image; uses upstream subdir
  conventions, loads transformer / vae / text_encoder / tokenizer plus
  optional pe / pe_tokenizer
- New invokeai/backend/ernie_image/ with sampling utilities (2x2 patchify,
  BN normalize/denormalize, sigma schedule, padded text packing) and a
  rectified-flow denoise loop supporting Euler/Heun/LCM
- Five invocations: model_loader, text_encoder (with prompt-enhancer
  toggle), denoise, vae_encode, vae_decode
- ErnieImageConditioningInfo + ConditioningField/Output + pickle allowlist
- ERNIE_IMAGE_SCHEDULER_MAP reusing the FlowMatch* scheduler classes
- New generation modes ernie_image_{txt2img,img2img,inpaint,outpaint}
- Starter models for baidu/ERNIE-Image and baidu/ERNIE-Image-Turbo
  + STARTER_BUNDLES entry

Frontend
- Regenerated services/api/schema.ts to expose the new node types
- Type unions extended (ImageOutput / LatentToImage / ImageToLatents /
  DenoiseLatents / MainModelLoaderNodes)
- ParamsState gains ernieImageScheduler + ernieImageUsePromptEnhancer,
  with reducers, selectors, and selectIsErnieImage
- buildErnieImageGraph (txt2img/img2img/inpaint/outpaint) wired into
  useEnqueueCanvas
- ERNIE entries added to MODEL_BASE_TO_{COLOR,LONG_NAME,SHORT_NAME}
  and prompt_enhancer to MODEL_TYPE_TO_LONG_NAME
- ParamErnieImageScheduler and ParamErnieImagePromptEnhancer rendered
  conditionally in GenerationSettingsAccordion
- All add{TextTo,ImageTo,Inpaint,Outpaint}Image type guards extended
  to accept ernie_image_denoise; isMainModelWithoutUnet ditto

Diffusers 0.38 fix-out
- hotfixes.py: import LoRACompatibleConv directly; the lazy-module
  __getattr__ no longer exposes diffusers.models.lora as an attribute

Verification
- pytest tests/ -m "not slow": 619 passed, 0 failed
- pnpm lint:tsc / lint:eslint / lint:prettier: clean
- pnpm test:no-watch: 563 passed, 0 failed
- Manual smoketest pending: requires baidu/ERNIE-Image weights and a
  GPU (8B parameters; CPU not practical)

Out of scope (follow-up phases)
- ControlNet, IP-Adapter, LoRA support for ERNIE-Image
- Single-file checkpoint loading (defensive scaffolding only)
- Metadata recall handlers in the gallery side panel
@github-actions github-actions Bot added api python PRs that change python files Root invocations PRs that change invocations backend PRs that change backend files services PRs that change app services frontend PRs that change frontend files python-deps PRs that change python dependencies labels May 3, 2026
…nd UI cleanup

- Pass timesteps in [0, num_train_timesteps] to the transformer instead of
  [0, 1]; the diffusers Timesteps embedding expects the unnormalised range,
  which produced mosaic-pattern garbage instead of an image.
- Unpatchify predicted-x0 before the denoise step callback and route through
  sd_step_callback so the canvas shows a live preview during sampling
  (uses FLUX.2's RGB factors -- same AutoencoderKLFlux2 / 32 latent channels).
- Add the ernie-image case to useEnqueueGenerate (Generate tab); was only
  wired up in useEnqueueCanvas, so plain text-to-image failed with
  "No graph builders for base ernie-image".
- Move the Prompt Enhancer toggle from the Generation accordion into a
  dedicated ERNIE-Image block in the Advanced accordion; hide the rest of
  the SD-style advanced controls (CLIP skip, CFG rescale, seamless, color
  comp., separate VAE) since none apply to ERNIE-Image.
- Detect ERNIE-Image-Turbo by name in MainModelDefaultSettings.from_base
  so installs (starter or manual) get steps=8, cfg_scale=1.0 instead of
  the standard 50/4.0.
- Pin compel to Cstannahill/compel5@chore/transformers5-diffusers-smoke
  for transformers>=5 compatibility (PR damian0815/compel#129).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api backend PRs that change backend files DO NOT MERGE frontend PRs that change frontend files invocations PRs that change invocations python PRs that change python files python-deps PRs that change python dependencies Root services PRs that change app services

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants