-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Description
Describe the bug
Description
While reviewing the current main-branch implementation of pipeline_lumina2, I noticed a potential bug in the calculation of mu within the pipeline's call.
In the following section of the code:
diffusers/src/diffusers/pipelines/lumina2/pipeline_lumina2.py
Lines 484 to 503 in 5ffb658
| def prepare_latents(self, batch_size, num_channels_latents, height, width, dtype, device, generator, latents=None): | |
| # VAE applies 8x compression on images but we must also account for packing which requires | |
| # latent height and width to be divisible by 2. | |
| height = 2 * (int(height) // (self.vae_scale_factor * 2)) | |
| width = 2 * (int(width) // (self.vae_scale_factor * 2)) | |
| shape = (batch_size, num_channels_latents, height, width) | |
| if isinstance(generator, list) and len(generator) != batch_size: | |
| raise ValueError( | |
| f"You have passed a list of generators of length {len(generator)}, but requested an effective batch" | |
| f" size of {batch_size}. Make sure the batch size matches the length of the generators." | |
| ) | |
| if latents is None: | |
| latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype) | |
| else: | |
| latents = latents.to(device) | |
| return latents |
The latent tensor appears to have the shape:
(batch_size, num_channels_latents, height, width)
However, later in the same file:
diffusers/src/diffusers/pipelines/lumina2/pipeline_lumina2.py
Lines 699 to 706 in 5ffb658
| image_seq_len = latents.shape[1] | |
| mu = calculate_shift( | |
| image_seq_len, | |
| self.scheduler.config.get("base_image_seq_len", 256), | |
| self.scheduler.config.get("max_image_seq_len", 4096), | |
| self.scheduler.config.get("base_shift", 0.5), | |
| self.scheduler.config.get("max_shift", 1.15), | |
| ) |
the value latent.shape[1] (i.e., num_channels_latents) is passed as the argument for image_seq_len when computing mu.
This seems incorrect, since image_seq_len should represent the number of image tokens or sequence length, not the number of latent channels.
Expected Behavior
image_seq_len should likely correspond to the number of spatial tokens derived from (height, width) (or another tokenization step), rather than the number of latent channels.
Actual Behavior
The current implementation uses latent.shape[1] as image_seq_len, which likely leads to unintended behavior in the computation of mu and subsequent sampling steps.
Suggested Fix
Review the logic where image_seq_len is passed, and ensure it reflects the correct sequence length dimension (possibly derived from spatial resolution or token count, rather than channel count).
Reproduction
At the moment, I don’t have a copy/paste runnable MRE because this was identified via manual logic review rather than reproducing the behavior in a runtime environment.
Logs
System Info
Diffusers==0.36.0
Python==3.13
Who can help?
No response