Git commit
stable-diffusion.cpp-master-682-b3d56d0
Operating System & Version
debian 12
GGML backends
CUDA
Command-line arguments used
sd-cli -p "a cat" \ --diffusion-model ideogram4-Q6_K.gguf \ --uncond-diffusion-model ideogram4_unconditional-iQ4_NL.gguf \ --llm Qwen3-VL-8B-Instruct-Q5_K_M.gguf \ --vae flux2-vae.safetensors \ --offload-to-cpu \ -H 1024 -W 1024 \ --backend diffusion=CUDA1,clip=CUDA0,vae=CUDA1 \ --max-vram 12
Steps to reproduce
you need 16gb vram limit of gpu to do it
What you expected to happen
not error
What actually happened
error
Logs / error messages / stack trace
ideogram4 graph cut max_vram=12288.00 MB merged 35 segments -> 2 segments ← cond pass (correct)
ideogram4 graph cut max_vram=12288.00 MB merged 35 segments -> 1 segments ← uncond pass (wrong!)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 13205.90 MiB on device 1: cudaMalloc failed: out of memory
Additional context / environment details
When running Ideogram 4 with --uncond-diffusion-model on a two-GPU setup where the diffusion GPU has insufficient VRAM to hold both the main model and
the uncond model at once, generation fails with cudaMalloc failed: out of memory — even when --max-vram and --offload-to-cpu should allow the graph
cutter to split the work into segments.
Hardware:
- GPU 0: RTX 3080 (10 GB) — text encoder
- GPU 1: Tesla V100 (16 GB) — diffusion model
Observed logs:
ideogram4 graph cut max_vram=12288.00 MB merged 35 segments -> 2 segments ← cond pass (correct)
ideogram4 graph cut max_vram=12288.00 MB merged 35 segments -> 1 segments ← uncond pass (wrong!)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 13205.90 MiB on device 1: cudaMalloc failed: out of memory
Root cause:
Ideogram4Runner stores both model and uncond_model weights in a single shared params_ctx. During CFG sampling, two different compute graphs are built:
- Cond graph (has text embeddings → uses model) — planner correctly splits into 2 segments
- Uncond graph (no text embeddings → uses uncond_model) — graph structure differs, so plan_matches_graph() fails, plan cache is invalidated, and the
planner rebuilds. The uncond graph is smaller, so the planner merges everything into 1 segment.
When the plan has 1 segment, should_use_graph_cut_segmented_compute() returns false, and the code falls through to execute_graph() with an empty
runtime_param_tensors vector (file ggml_extend.hpp, ~line 3210). This triggers offload_all_params(), which allocates VRAM for all tensors in params_ctx
— both model AND uncond_model combined (~13.2 GB). The graph cutter's VRAM accounting (graph_cut_segment_vram_bytes) only counted params referenced in
the current graph, so the budget check passed, but the actual allocation far exceeds it.
The mismatch:
- Plan-time (graph_cut_segment_vram_bytes): counts only params in the current graph (uncond model ~6-7 GB)
- Execution-time (offload_all_params): allocates ALL params in params_ctx (both models ~13.2 GB)
Suggested fix:
In the compute() template in ggml_extend.hpp, when the graph-cut plan resolves to 1 segment, extract the segment's param tensor list and pass it to
execute_graph() so it uses offload_partial_params() instead of offload_all_params():
{
std::vector<ggml_tensor*> graph_param_tensors;
if (can_attempt_graph_cut_segmented_compute()) {
GraphCutPlan plan;
size_t effective_graph_vram_bytes = 0;
if (!resolve_graph_cut_plan(gf, &plan, &effective_graph_vram_bytes)) {
free_compute_ctx();
return std::nullopt;
}
if (should_use_graph_cut_segmented_compute(plan)) {
// ... existing segmented paths ...
}
// Use the plan's param list to avoid offloading unrelated
// model weights that share the same params_ctx.
if (plan.valid && plan.segments.size() == 1) {
graph_param_tensors = sd::ggml_graph_cut::runtime_param_tensors(
gf, plan.segments[0], get_desc().c_str());
}
}
if (!alloc_compute_buffer(gf)) { ... }
return execute_graph(gf, n_threads,
free_compute_buffer_immediately,
graph_param_tensors, // was: {}
false, no_return);
}
This ensures that even with a single-segment plan, only the params actually referenced by the current graph are loaded to VRAM — matching what the
segmented path already does correctly via compute_with_graph_cuts().
llm was used to write the fix it was not stressted nor run regession test
Git commit
stable-diffusion.cpp-master-682-b3d56d0
Operating System & Version
debian 12
GGML backends
CUDA
Command-line arguments used
sd-cli -p "a cat" \ --diffusion-model ideogram4-Q6_K.gguf \ --uncond-diffusion-model ideogram4_unconditional-iQ4_NL.gguf \ --llm Qwen3-VL-8B-Instruct-Q5_K_M.gguf \ --vae flux2-vae.safetensors \ --offload-to-cpu \ -H 1024 -W 1024 \ --backend diffusion=CUDA1,clip=CUDA0,vae=CUDA1 \ --max-vram 12
Steps to reproduce
you need 16gb vram limit of gpu to do it
What you expected to happen
not error
What actually happened
error
Logs / error messages / stack trace
ideogram4 graph cut max_vram=12288.00 MB merged 35 segments -> 2 segments ← cond pass (correct)
ideogram4 graph cut max_vram=12288.00 MB merged 35 segments -> 1 segments ← uncond pass (wrong!)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 13205.90 MiB on device 1: cudaMalloc failed: out of memory
Additional context / environment details
When running Ideogram 4 with --uncond-diffusion-model on a two-GPU setup where the diffusion GPU has insufficient VRAM to hold both the main model and
the uncond model at once, generation fails with cudaMalloc failed: out of memory — even when --max-vram and --offload-to-cpu should allow the graph
cutter to split the work into segments.
Hardware:
Observed logs:
ideogram4 graph cut max_vram=12288.00 MB merged 35 segments -> 2 segments ← cond pass (correct)
ideogram4 graph cut max_vram=12288.00 MB merged 35 segments -> 1 segments ← uncond pass (wrong!)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 13205.90 MiB on device 1: cudaMalloc failed: out of memory
Root cause:
Ideogram4Runner stores both model and uncond_model weights in a single shared params_ctx. During CFG sampling, two different compute graphs are built:
planner rebuilds. The uncond graph is smaller, so the planner merges everything into 1 segment.
When the plan has 1 segment, should_use_graph_cut_segmented_compute() returns false, and the code falls through to execute_graph() with an empty
runtime_param_tensors vector (file ggml_extend.hpp, ~line 3210). This triggers offload_all_params(), which allocates VRAM for all tensors in params_ctx
— both model AND uncond_model combined (~13.2 GB). The graph cutter's VRAM accounting (graph_cut_segment_vram_bytes) only counted params referenced in
the current graph, so the budget check passed, but the actual allocation far exceeds it.
The mismatch:
Suggested fix:
In the compute() template in ggml_extend.hpp, when the graph-cut plan resolves to 1 segment, extract the segment's param tensor list and pass it to
execute_graph() so it uses offload_partial_params() instead of offload_all_params():
{
std::vector<ggml_tensor*> graph_param_tensors;
if (can_attempt_graph_cut_segmented_compute()) {
GraphCutPlan plan;
size_t effective_graph_vram_bytes = 0;
if (!resolve_graph_cut_plan(gf, &plan, &effective_graph_vram_bytes)) {
free_compute_ctx();
return std::nullopt;
}
if (should_use_graph_cut_segmented_compute(plan)) {
// ... existing segmented paths ...
}
// Use the plan's param list to avoid offloading unrelated
// model weights that share the same params_ctx.
if (plan.valid && plan.segments.size() == 1) {
graph_param_tensors = sd::ggml_graph_cut::runtime_param_tensors(
gf, plan.segments[0], get_desc().c_str());
}
}
if (!alloc_compute_buffer(gf)) { ... }
return execute_graph(gf, n_threads,
free_compute_buffer_immediately,
graph_param_tensors, // was: {}
false, no_return);
}
This ensures that even with a single-segment plan, only the params actually referenced by the current graph are loaded to VRAM — matching what the
segmented path already does correctly via compute_with_graph_cuts().
llm was used to write the fix it was not stressted nor run regession test