Skip to content

[Bug] OOM when using --uncond-diffusion-model with Ideogram 4 on multi-GPU with VRAM-constrained --max-vram #1624

@akwmw

Description

@akwmw

Git commit

stable-diffusion.cpp-master-682-b3d56d0

Operating System & Version

debian 12

GGML backends

CUDA

Command-line arguments used

sd-cli -p "a cat" \ --diffusion-model ideogram4-Q6_K.gguf \ --uncond-diffusion-model ideogram4_unconditional-iQ4_NL.gguf \ --llm Qwen3-VL-8B-Instruct-Q5_K_M.gguf \ --vae flux2-vae.safetensors \ --offload-to-cpu \ -H 1024 -W 1024 \ --backend diffusion=CUDA1,clip=CUDA0,vae=CUDA1 \ --max-vram 12

Steps to reproduce

you need 16gb vram limit of gpu to do it

What you expected to happen

not error

What actually happened

error

Logs / error messages / stack trace

ideogram4 graph cut max_vram=12288.00 MB merged 35 segments -> 2 segments ← cond pass (correct)
ideogram4 graph cut max_vram=12288.00 MB merged 35 segments -> 1 segments ← uncond pass (wrong!)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 13205.90 MiB on device 1: cudaMalloc failed: out of memory

Additional context / environment details

When running Ideogram 4 with --uncond-diffusion-model on a two-GPU setup where the diffusion GPU has insufficient VRAM to hold both the main model and
the uncond model at once, generation fails with cudaMalloc failed: out of memory — even when --max-vram and --offload-to-cpu should allow the graph
cutter to split the work into segments.

Hardware:

  • GPU 0: RTX 3080 (10 GB) — text encoder
  • GPU 1: Tesla V100 (16 GB) — diffusion model

Observed logs:
ideogram4 graph cut max_vram=12288.00 MB merged 35 segments -> 2 segments ← cond pass (correct)
ideogram4 graph cut max_vram=12288.00 MB merged 35 segments -> 1 segments ← uncond pass (wrong!)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 13205.90 MiB on device 1: cudaMalloc failed: out of memory

Root cause:

Ideogram4Runner stores both model and uncond_model weights in a single shared params_ctx. During CFG sampling, two different compute graphs are built:

  1. Cond graph (has text embeddings → uses model) — planner correctly splits into 2 segments
  2. Uncond graph (no text embeddings → uses uncond_model) — graph structure differs, so plan_matches_graph() fails, plan cache is invalidated, and the
    planner rebuilds. The uncond graph is smaller, so the planner merges everything into 1 segment.

When the plan has 1 segment, should_use_graph_cut_segmented_compute() returns false, and the code falls through to execute_graph() with an empty
runtime_param_tensors vector (file ggml_extend.hpp, ~line 3210). This triggers offload_all_params(), which allocates VRAM for all tensors in params_ctx
— both model AND uncond_model combined (~13.2 GB). The graph cutter's VRAM accounting (graph_cut_segment_vram_bytes) only counted params referenced in
the current graph, so the budget check passed, but the actual allocation far exceeds it.

The mismatch:

  • Plan-time (graph_cut_segment_vram_bytes): counts only params in the current graph (uncond model ~6-7 GB)
  • Execution-time (offload_all_params): allocates ALL params in params_ctx (both models ~13.2 GB)

Suggested fix:

In the compute() template in ggml_extend.hpp, when the graph-cut plan resolves to 1 segment, extract the segment's param tensor list and pass it to
execute_graph() so it uses offload_partial_params() instead of offload_all_params():

{
std::vector<ggml_tensor*> graph_param_tensors;
if (can_attempt_graph_cut_segmented_compute()) {
GraphCutPlan plan;
size_t effective_graph_vram_bytes = 0;
if (!resolve_graph_cut_plan(gf, &plan, &effective_graph_vram_bytes)) {
free_compute_ctx();
return std::nullopt;
}
if (should_use_graph_cut_segmented_compute(plan)) {
// ... existing segmented paths ...
}
// Use the plan's param list to avoid offloading unrelated
// model weights that share the same params_ctx.
if (plan.valid && plan.segments.size() == 1) {
graph_param_tensors = sd::ggml_graph_cut::runtime_param_tensors(
gf, plan.segments[0], get_desc().c_str());
}
}
if (!alloc_compute_buffer(gf)) { ... }
return execute_graph(gf, n_threads,
free_compute_buffer_immediately,
graph_param_tensors, // was: {}
false, no_return);
}

This ensures that even with a single-segment plan, only the params actually referenced by the current graph are loaded to VRAM — matching what the
segmented path already does correctly via compute_with_graph_cuts().

llm was used to write the fix it was not stressted nor run regession test

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions