Flux2: We should: Load text-encoder and diffusor sequentially not in parallel. #1452

clort81 · 2026-04-22T15:10:40Z

clort81
Apr 22, 2026

For Flux 2 generation under memory constrained hardware:

Why load text encoder ~4GB and diffusor ~4-8GB at the same time?

We don't need the TE loaded at all during generation. Just cache the embedding (preferably to disk for fast re-gens). Unload the text encoder. Load the Flux2 model of your choice. Runs fine in 6GB VRAM.

stduhpf · 2026-04-22T15:15:49Z

stduhpf
Apr 22, 2026

You can use --offload-to-cpu to load them to ram first and only move the weights to VRAM when they're in use.

Also loading is technically already sequential, which makes it slower than it could be when the models and text encoders are on different drives.

4 replies

clort81 Apr 22, 2026
Author

Thanks for your reply! Unfortunately there's no benefit to your proposal, as this is a shared-memory iGPU, so nothing is gained from 'moving' from System Ram (RAM) to System Ram (VRAM) or vice-versa.

The total memory demand is too much because we have two ~5GB things loaded into RAM when we only need one ~5GB thing at a time.

stduhpf Apr 22, 2026

Ah I see. In this case you're right.

clort81 Apr 22, 2026
Author

I'm still beside myself with happiness that this works at all 👯

Most modern image-gen, running on a $150 office laptop with 16GB RAM, both 4B AND 9B!

Bees for everyone!

wbruna Apr 22, 2026

#1414 may help with that, since it makes the "RAM" part able to be discarded by the OS instead of having to be swapped out (tested on my own 3400G).

clort81 · 2026-04-22T19:01:28Z

clort81
Apr 22, 2026
Author

Some info from a 9b run on a 16GB thinkpad TP495 (ryzen3500u Vega8 iGPU) (!)

[INFO ] model.cpp:1568 - loading tensors completed, taking 5.03s (process: 0.00s, read: 2.43s, memcpy: 0.00s, co.03s)
[INFO ] stable-diffusion.cpp:881  - total params memory size = 9508.65MB (VRAM 4551.10MB, RAM 4957.55MB): text_en_model 4548.54MB(VRAM), vae 2.56MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)

In early stage sd-cli consumes a bit over 10GB when it really only needs 6GB to do the job.
Is this an artefact of other models that integrate the clip and diffusor in an inseperable way?

It does seem the wrong approach for flux-2 klein.

I have to say, it's friggin awesome to get flux-2-klein 4GB gens done in under 5 minutes, and 9b around 20 mins. CPU is idle most of the time meaning if I can fix the unneeded memory consumption, I can leave sd-cli running in batch mode all day..

That's fantastic! Thanks to leejet and all contributors!!!!!!

0 replies

wbruna · 2026-04-22T19:23:34Z

wbruna
Apr 22, 2026

Why load text encoder ~4GB and diffusor ~4-8GB at the same time?

This could be improved, but...

We don't need the TE loaded at all during generation. Just cache the embedding (preferably to disk for fast re-gens). Unload the text encoder. Load the Flux2 model of your choice. Runs fine in 6GB VRAM.

... not in this way, because we already do that: sd-cli unloads the conditioner weights before generation. The problem is the other way around: the text encoder runs with the diffusion weights already loaded into VRAM, and that's what typically causes peak VRAM usage.

0 replies

clort81 · 2026-04-22T22:06:54Z

clort81
Apr 22, 2026
Author

I super-appreciate all your contributions, wbruna. ty. Maybe we'll be able to move loading of diffusion weights until after the text-encoder is finished.

Would that break other models to do it in that order?

Right now with just sd-cli, xorg and a browser running I see:

               total        used        free      shared  buff/cache   available
Mem:        14229120    10396644      242376       35268     3921188     3832476
Swap:       20971516     1952840    19018676

That's during this stage:

[INFO ] denoiser.hpp:499  - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:2691 - sampling using Euler method

Then once it gets to:

[INFO ] stable-diffusion.cpp:3045 - get_learned_condition completed, taking 332.35s
[INFO ] stable-diffusion.cpp:3149 - generating image: 1/1 - seed 42

Laptop drops to around 6.5GB 'used'

               total        used        free      shared  buff/cache   available
Mem:        14229120     6537736     5277908       37876     2831420     7691384
Swap:       20971516     1653860    19317656

Why zram is using 'swap' there idk, things seem to get sticky when allocated into 'swap'.

If we could avoid having Qwen TE and Flux loaded at the same time, I could have fully usable laptop while it does image generation in the background.

0 replies

Flux2: We should: Load text-encoder and diffusor sequentially not in parallel. #1452

Uh oh!

clort81 Apr 22, 2026

Why load text encoder ~4GB and diffusor ~4-8GB at the same time?

Replies: 4 comments · 4 replies

Uh oh!

Uh oh!

stduhpf Apr 22, 2026

Uh oh!

Uh oh!

clort81 Apr 22, 2026 Author

Uh oh!

stduhpf Apr 22, 2026

Uh oh!

Uh oh!

clort81 Apr 22, 2026 Author

Uh oh!

wbruna Apr 22, 2026

Uh oh!

clort81 Apr 22, 2026 Author

Uh oh!

wbruna Apr 22, 2026

Why load text encoder ~4GB and diffusor ~4-8GB at the same time?

Uh oh!

Uh oh!

clort81 Apr 22, 2026 Author

clort81
Apr 22, 2026

Replies: 4 comments 4 replies

stduhpf
Apr 22, 2026

clort81 Apr 22, 2026
Author

clort81 Apr 22, 2026
Author

clort81
Apr 22, 2026
Author

wbruna
Apr 22, 2026

clort81
Apr 22, 2026
Author