[Bug] OOM when using --uncond-diffusion-model with Ideogram 4 on multi-GPU with VRAM-constrained --max-vram

### Git commit

stable-diffusion.cpp-master-682-b3d56d0

### Operating System & Version

debian 12

### GGML backends

CUDA

### Command-line arguments used

 sd-cli -p "a cat" \     --diffusion-model ideogram4-Q6_K.gguf \     --uncond-diffusion-model ideogram4_unconditional-iQ4_NL.gguf \     --llm Qwen3-VL-8B-Instruct-Q5_K_M.gguf \     --vae flux2-vae.safetensors \     --offload-to-cpu \     -H 1024 -W 1024 \     --backend diffusion=CUDA1,clip=CUDA0,vae=CUDA1 \     --max-vram 12

### Steps to reproduce

you need  16gb vram limit of gpu to do it

### What you expected to happen

not error

### What actually happened

error

### Logs / error messages / stack trace

 ideogram4 graph cut max_vram=12288.00 MB merged 35 segments -> 2 segments   ← cond pass (correct)
  ideogram4 graph cut max_vram=12288.00 MB merged 35 segments -> 1 segments   ← uncond pass (wrong!)
  ggml_backend_cuda_buffer_type_alloc_buffer: allocating 13205.90 MiB on device 1: cudaMalloc failed: out of memory

### Additional context / environment details


  When running Ideogram 4 with --uncond-diffusion-model on a two-GPU setup where the diffusion GPU has insufficient VRAM to hold both the main model and
  the uncond model at once, generation fails with cudaMalloc failed: out of memory — even when --max-vram and --offload-to-cpu should allow the graph
  cutter to split the work into segments.

  Hardware:
  - GPU 0: RTX 3080 (10 GB) — text encoder
  - GPU 1: Tesla V100 (16 GB) — diffusion model

 

  Observed logs:
  ideogram4 graph cut max_vram=12288.00 MB merged 35 segments -> 2 segments   ← cond pass (correct)
  ideogram4 graph cut max_vram=12288.00 MB merged 35 segments -> 1 segments   ← uncond pass (wrong!)
  ggml_backend_cuda_buffer_type_alloc_buffer: allocating 13205.90 MiB on device 1: cudaMalloc failed: out of memory

  Root cause:

  Ideogram4Runner stores both model and uncond_model weights in a single shared params_ctx. During CFG sampling, two different compute graphs are built:
  
  1. Cond graph (has text embeddings → uses model) — planner correctly splits into 2 segments
  2. Uncond graph (no text embeddings → uses uncond_model) — graph structure differs, so plan_matches_graph() fails, plan cache is invalidated, and the
  planner rebuilds. The uncond graph is smaller, so the planner merges everything into 1 segment.

  When the plan has 1 segment, should_use_graph_cut_segmented_compute() returns false, and the code falls through to execute_graph() with an empty
  runtime_param_tensors vector (file ggml_extend.hpp, ~line 3210). This triggers offload_all_params(), which allocates VRAM for all tensors in params_ctx
  — both model AND uncond_model combined (~13.2 GB). The graph cutter's VRAM accounting (graph_cut_segment_vram_bytes) only counted params referenced in
  the current graph, so the budget check passed, but the actual allocation far exceeds it.

  The mismatch:
  - Plan-time (graph_cut_segment_vram_bytes): counts only params in the current graph (uncond model ~6-7 GB)
  - Execution-time (offload_all_params): allocates ALL params in params_ctx (both models ~13.2 GB)

  Suggested fix:

  In the compute() template in ggml_extend.hpp, when the graph-cut plan resolves to 1 segment, extract the segment's param tensor list and pass it to
  execute_graph() so it uses offload_partial_params() instead of offload_all_params():

  {
      std::vector<ggml_tensor*> graph_param_tensors;
      if (can_attempt_graph_cut_segmented_compute()) {
          GraphCutPlan plan;
          size_t effective_graph_vram_bytes = 0;
          if (!resolve_graph_cut_plan(gf, &plan, &effective_graph_vram_bytes)) {
              free_compute_ctx();
              return std::nullopt;
          }
          if (should_use_graph_cut_segmented_compute(plan)) {
              // ... existing segmented paths ...
          }
          // Use the plan's param list to avoid offloading unrelated
          // model weights that share the same params_ctx.
          if (plan.valid && plan.segments.size() == 1) {
              graph_param_tensors = sd::ggml_graph_cut::runtime_param_tensors(
                  gf, plan.segments[0], get_desc().c_str());
          }
      }
      if (!alloc_compute_buffer(gf)) { ... }
      return execute_graph<T>(gf, n_threads,
                              free_compute_buffer_immediately,
                              graph_param_tensors, // was: {}
                              false, no_return);
  }

  This ensures that even with a single-segment plan, only the params actually referenced by the current graph are loaded to VRAM — matching what the
  segmented path already does correctly via compute_with_graph_cuts().

llm was used to write the fix it was not stressted nor run regession test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] OOM when using --uncond-diffusion-model with Ideogram 4 on multi-GPU with VRAM-constrained --max-vram #1624

Git commit

Operating System & Version

GGML backends

Command-line arguments used

Steps to reproduce

What you expected to happen

What actually happened

Logs / error messages / stack trace

Additional context / environment details

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Bug] OOM when using --uncond-diffusion-model with Ideogram 4 on multi-GPU with VRAM-constrained --max-vram #1624

Description

Git commit

Operating System & Version

GGML backends

Command-line arguments used

Steps to reproduce

What you expected to happen

What actually happened

Logs / error messages / stack trace

Additional context / environment details

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions