feat: optional sequential component loading (--sequential-load) by RapidMark · Pull Request #1628 · leejet/stable-diffusion.cpp

RapidMark · 2026-06-10T14:41:52Z

feat: optional sequential component loading to cut peak device memory

Summary

Adds an opt-in sequential load path: instead of allocating and loading the
conditioner, diffusion model, and VAE all up front, load the conditioner
first, run it, free it, and only then allocate + load the diffusion model.

This lowers peak device memory from roughly

sum(conditioner, diffusion, VAE)   ->   ~max(conditioner, diffusion + VAE)

For pipelines with a large text encoder this is significant. On LTX-2 (Gemma-3
12B text encoder + a ~16.5 GB DiT + video VAE), it drops peak device memory from
~30 GB to ~18 GB, which lets the fast "text-encoder-on-GPU" path fit on cards
that otherwise can't hold all three modules at once — without falling back to
running the text encoder on CPU.

Default off; no behavior change unless explicitly enabled.

Motivation

On a memory-constrained GPU the choice today is binary:

Run the text encoder on the GPU → fast, but requires all modules resident
simultaneously (often won't fit), or
Run the text encoder on CPU (--clip-on-cpu / backend override) → fits, but
the text-encode step is dramatically slower.

Measured on a Strix Halo 8060S iGPU (Vulkan, LTX-2, 768×512 / 25 frames / 8
steps), the conditioning (text-encode) stage alone:

Text encoder placement	Conditioning time
CPU	65.4 s
GPU	7.2 s

That ~58 s delta is paid on every generation. Sequential loading makes the
GPU path reachable on hardware that can hold max(encoder, diffusion+VAE) but
not their sum — you free the encoder before the diffusion model is resident, so
its memory window doesn't overlap the diffusion model's.

What it does

At load time, when sequential load is enabled and there is a single diffusion
model, the diffusion-model param buffer is not allocated and its tensors
are excluded from the initial tensor load. The model loader is retained.
The conditioner (and VAE) load normally and the conditioner runs.
After conditioning — and after the existing "free params immediately" path
frees the conditioner — the diffusion model is allocated and loaded on
demand, just before the first denoise step.

The deferred load is wired in at the top of the common sample() path, so it
covers txt2img / img2img / video uniformly and is a no-op when not enabled.

Interface

--sequential-load (default off), with a matching sd_ctx_params field. It
only engages when there is a single diffusion model (skipped when a
high-noise/refiner model is also present, to keep this change small).

Scope / limitations

Single diffusion model only for now (no high-noise/refiner second model).
Backend-agnostic — implemented entirely in stable-diffusion.cpp using
existing alloc_params_buffer() / ModelLoader::load_tensors(); no backend
patches, works on CPU/Vulkan/CUDA alike.
The deferred load re-reads the diffusion-model file once after conditioning
(the conditioner/VAE storages are ignored on that pass), costing a few seconds
of load time that previously happened up front. On hardware where everything
already fits, this is pure overhead with no benefit — the win is specifically
for memory-constrained devices.

Validation

Correctness (Strix Halo 8060S, Vulkan, LTX-2, seed-fixed):

Output is bit-identical to the non-sequential path at the same seed
(max |Δpixel| = 0 across frames). The deferred diffusion-model load does not
alter results.
With the flag off, output is bit-identical to the pre-change binary — the
default path is unchanged.

Peak device memory (same config):

Load/conditioning phase: ~13.7 GB (diffusion model deferred), vs ~30 GB.
Sampling phase (conditioner freed): ~18 GB, vs ~30 GB.

Cross-arch (RDNA2 RX 6700 XT, 12 GB, Vulkan, no matrix cores; Flux Schnell Q4, 512², seed 42):

Default: peak 9.78 GB, 16.5 s wall.
--sequential-load: deferred 776 diffusion-model tensors; first-load 3.3 GB
(DiT deferred), DiT allocated + loaded 3.19 s after the conditioner freed;
peak 7.22 GB, 15.1 s wall.
Output sha256 bit-identical to default; no perf regression.
Composes cleanly with the existing VAE auto-tiling fallback on the same card.

So the peak-memory reduction (~2.5 GB here; ~12 GB on LTX-2) and bit-identical
output are confirmed on a second GPU architecture (RDNA2, no matrix cores), not
just the iGPU it was developed on.

How to test

# Baseline (encoder on CPU): slow conditioning, low peak memory
sd ... --clip-on-cpu

# Sequential load (encoder on GPU, freed before diffusion model loads):
sd ... --sequential-load

Compare conditioning time, peak device memory, and output (identical at a fixed
seed).

Related work

#1470 also defers diffusion-model loading, but as one piece of a larger,
CUDA-focused multi-GPU autofit (it ships a ggml-cuda.cu patch and a row/tensor
split mechanism). This PR is the minimal, backend-agnostic slice of that idea: a
single-device sequential load that needs no backend patch and benefits every
backend today. Offered standalone so the peak-memory win isn't gated on the
larger multi-GPU work.

Load the conditioner, run it, free it, then allocate and load the diffusion model -- instead of holding all components resident at once. Lowers peak device memory from ~sum(conditioner, diffusion, VAE) to ~max(conditioner, diffusion + VAE), so the fast "text encoder on GPU" path fits memory-constrained cards that otherwise cannot hold all three simultaneously. Opt-in via --sequential-load (default off; no behavior change otherwise). Single diffusion model only (skipped when a high-noise/refiner model is also present). Backend-agnostic -- implemented in stable-diffusion.cpp using the existing alloc_params_buffer() / ModelLoader::load_tensors(), with no backend patches. Validated on Strix Halo 8060S (Vulkan, LTX-2) and RX 6700 XT (RDNA2, 12GB): bit-identical output to the default path at a fixed seed, and the flag-off path is byte-identical to before. Peak device memory on RDNA2 (Flux Schnell Q4, 512^2) drops 9.78 -> 7.22 GB with no perf regression.

RapidMark force-pushed the cloudhands/sequential-load branch from 2805e92 to bc46158 Compare June 10, 2026 14:44

RapidMark force-pushed the cloudhands/sequential-load branch from bc46158 to 82c7c1e Compare June 10, 2026 14:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: optional sequential component loading (--sequential-load)#1628

feat: optional sequential component loading (--sequential-load)#1628
RapidMark wants to merge 1 commit into
leejet:masterfrom
CloudhandsAI:cloudhands/sequential-load

RapidMark commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RapidMark commented Jun 10, 2026

feat: optional sequential component loading to cut peak device memory

Summary

Motivation

What it does

Interface

Scope / limitations

Validation

How to test

Related work

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant