feat: optional sequential component loading (--sequential-load)#1628
Open
RapidMark wants to merge 1 commit into
Open
feat: optional sequential component loading (--sequential-load)#1628RapidMark wants to merge 1 commit into
RapidMark wants to merge 1 commit into
Conversation
2805e92 to
bc46158
Compare
Load the conditioner, run it, free it, then allocate and load the diffusion model -- instead of holding all components resident at once. Lowers peak device memory from ~sum(conditioner, diffusion, VAE) to ~max(conditioner, diffusion + VAE), so the fast "text encoder on GPU" path fits memory-constrained cards that otherwise cannot hold all three simultaneously. Opt-in via --sequential-load (default off; no behavior change otherwise). Single diffusion model only (skipped when a high-noise/refiner model is also present). Backend-agnostic -- implemented in stable-diffusion.cpp using the existing alloc_params_buffer() / ModelLoader::load_tensors(), with no backend patches. Validated on Strix Halo 8060S (Vulkan, LTX-2) and RX 6700 XT (RDNA2, 12GB): bit-identical output to the default path at a fixed seed, and the flag-off path is byte-identical to before. Peak device memory on RDNA2 (Flux Schnell Q4, 512^2) drops 9.78 -> 7.22 GB with no perf regression.
bc46158 to
82c7c1e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat: optional sequential component loading to cut peak device memory
Summary
Adds an opt-in sequential load path: instead of allocating and loading the
conditioner, diffusion model, and VAE all up front, load the conditioner
first, run it, free it, and only then allocate + load the diffusion model.
This lowers peak device memory from roughly
For pipelines with a large text encoder this is significant. On LTX-2 (Gemma-3
12B text encoder + a ~16.5 GB DiT + video VAE), it drops peak device memory from
~30 GB to ~18 GB, which lets the fast "text-encoder-on-GPU" path fit on cards
that otherwise can't hold all three modules at once — without falling back to
running the text encoder on CPU.
Default off; no behavior change unless explicitly enabled.
Motivation
On a memory-constrained GPU the choice today is binary:
simultaneously (often won't fit), or
--clip-on-cpu/ backend override) → fits, butthe text-encode step is dramatically slower.
Measured on a Strix Halo 8060S iGPU (Vulkan, LTX-2, 768×512 / 25 frames / 8
steps), the conditioning (text-encode) stage alone:
That ~58 s delta is paid on every generation. Sequential loading makes the
GPU path reachable on hardware that can hold
max(encoder, diffusion+VAE)butnot their sum — you free the encoder before the diffusion model is resident, so
its memory window doesn't overlap the diffusion model's.
What it does
model, the diffusion-model param buffer is not allocated and its tensors
are excluded from the initial tensor load. The model loader is retained.
frees the conditioner — the diffusion model is allocated and loaded on
demand, just before the first denoise step.
The deferred load is wired in at the top of the common
sample()path, so itcovers txt2img / img2img / video uniformly and is a no-op when not enabled.
Interface
--sequential-load(default off), with a matchingsd_ctx_paramsfield. Itonly engages when there is a single diffusion model (skipped when a
high-noise/refiner model is also present, to keep this change small).
Scope / limitations
stable-diffusion.cppusingexisting
alloc_params_buffer()/ModelLoader::load_tensors(); no backendpatches, works on CPU/Vulkan/CUDA alike.
(the conditioner/VAE storages are ignored on that pass), costing a few seconds
of load time that previously happened up front. On hardware where everything
already fits, this is pure overhead with no benefit — the win is specifically
for memory-constrained devices.
Validation
Correctness (Strix Halo 8060S, Vulkan, LTX-2, seed-fixed):
(max |Δpixel| = 0 across frames). The deferred diffusion-model load does not
alter results.
default path is unchanged.
Peak device memory (same config):
Cross-arch (RDNA2 RX 6700 XT, 12 GB, Vulkan, no matrix cores; Flux Schnell Q4, 512², seed 42):
--sequential-load: deferred 776 diffusion-model tensors; first-load 3.3 GB(DiT deferred), DiT allocated + loaded 3.19 s after the conditioner freed;
peak 7.22 GB, 15.1 s wall.
So the peak-memory reduction (~2.5 GB here; ~12 GB on LTX-2) and bit-identical
output are confirmed on a second GPU architecture (RDNA2, no matrix cores), not
just the iGPU it was developed on.
How to test
Compare conditioning time, peak device memory, and output (identical at a fixed
seed).
Related work
#1470 also defers diffusion-model loading, but as one piece of a larger,
CUDA-focused multi-GPU autofit (it ships a
ggml-cuda.cupatch and a row/tensorsplit mechanism). This PR is the minimal, backend-agnostic slice of that idea: a
single-device sequential load that needs no backend patch and benefits every
backend today. Offered standalone so the peak-memory win isn't gated on the
larger multi-GPU work.