UPSTREAM PR #1261: refactor: move VAE tiling parameters to SDGenerationParams#63
UPSTREAM PR #1261: refactor: move VAE tiling parameters to SDGenerationParams#63
Conversation
OverviewAnalysis of stable-diffusion.cpp refactoring commit (2367bc7: "move VAE tiling parameters to SDGenerationParams") across 48,374 functions shows minimal performance impact. Modified: 78 functions; New: 80; Removed: 80; Unchanged: 48,136. Binaries Analyzed:
The refactoring successfully moves VAE tiling parameters from context initialization to per-generation configuration, enabling flexible memory management with acceptable performance trade-offs. Function AnalysisConfiguration Parsing (Initialization Only): SDContextParams::get_options() improved across both binaries: response time -6.6% (sd-server: 279,572ns → 261,119ns; sd-cli: 280,187ns → 261,795ns), throughput time -7.6% to -9.6% due to removing 4 VAE tiling options. This simplification reduced branching and parsing overhead. SDGenerationParams::get_options() regressed consistently: response time +5.95-5.96% (sd-server: 306,582ns → 324,830ns; sd-cli: 307,317ns → 325,643ns), throughput time +6.11% due to adding the same 4 options with complex parsing logic. The ~200ns self-time increase reflects additional option registration overhead. SDGenerationParams::to_string() (sd-cli) regressed +17.4% throughput time (1,714ns → 2,012ns) from serializing 6 additional vae_tiling_params fields—expected for a diagnostic function. GGML Backend (Model Loading/Inference): make_block_q4_Kx8 (sd-server) regressed +7.9% (8,126ns → 8,768ns) in both response and throughput time, indicating intrinsic overhead in quantization repacking. Affects model loading, not inference hot path. forward_mul_mat for block_iq4_nl (sd-server) shows +5.38% response time regression (12,916ns → 13,611ns) while throughput time remains stable (2,390ns), indicating child function slowdown rather than direct implementation changes. This matrix multiplication function is inference-critical, though stable self-time suggests indirect impact. Standard Library Optimizations: Multiple functions improved significantly: std::make_move_iterator -58.6% response time (287ns → 119ns), __gnu_cxx::__normal_iterator::operator+ -42.1% (165ns → 95ns), std::swap -11% (112ns → 100ns), std::__unique -5.8% response time. These compiler optimizations partially offset regressions. Other analyzed functions (JSON access, regex compilation, vector reallocation) showed minor self-time variations with negligible total execution impact. Additional FindingsThe architectural refactoring achieves its goal of enabling per-generation VAE tiling control with minimal cost. Configuration parsing improvements offset regressions, resulting in balanced initialization performance. Most performance changes affect initialization rather than inference hot paths. The forward_mul_mat regression warrants monitoring in production, though stable self-time suggests the function's implementation is unchanged with slowdown in GGML dependencies. Power consumption increases (<1%) are negligible for image generation workloads taking seconds to minutes per image. 🔎 Full breakdown: Loci Inspector. |
Note
Source pull request: leejet/stable-diffusion.cpp#1261