Skip to content

Is Lumina2Pipeline's mu calculation correct? #12913

@hwangdonghyun

Description

@hwangdonghyun

Describe the bug

Description

While reviewing the current main-branch implementation of pipeline_lumina2, I noticed a potential bug in the calculation of mu within the pipeline's call.

In the following section of the code:

def prepare_latents(self, batch_size, num_channels_latents, height, width, dtype, device, generator, latents=None):
# VAE applies 8x compression on images but we must also account for packing which requires
# latent height and width to be divisible by 2.
height = 2 * (int(height) // (self.vae_scale_factor * 2))
width = 2 * (int(width) // (self.vae_scale_factor * 2))
shape = (batch_size, num_channels_latents, height, width)
if isinstance(generator, list) and len(generator) != batch_size:
raise ValueError(
f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
f" size of {batch_size}. Make sure the batch size matches the length of the generators."
)
if latents is None:
latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
else:
latents = latents.to(device)
return latents

The latent tensor appears to have the shape:

(batch_size, num_channels_latents, height, width)

However, later in the same file:

image_seq_len = latents.shape[1]
mu = calculate_shift(
image_seq_len,
self.scheduler.config.get("base_image_seq_len", 256),
self.scheduler.config.get("max_image_seq_len", 4096),
self.scheduler.config.get("base_shift", 0.5),
self.scheduler.config.get("max_shift", 1.15),
)

the value latent.shape[1] (i.e., num_channels_latents) is passed as the argument for image_seq_len when computing mu.
This seems incorrect, since image_seq_len should represent the number of image tokens or sequence length, not the number of latent channels.

Expected Behavior

image_seq_len should likely correspond to the number of spatial tokens derived from (height, width) (or another tokenization step), rather than the number of latent channels.

Actual Behavior

The current implementation uses latent.shape[1] as image_seq_len, which likely leads to unintended behavior in the computation of mu and subsequent sampling steps.

Suggested Fix

Review the logic where image_seq_len is passed, and ensure it reflects the correct sequence length dimension (possibly derived from spatial resolution or token count, rather than channel count).

Reproduction

At the moment, I don’t have a copy/paste runnable MRE because this was identified via manual logic review rather than reproducing the behavior in a runtime environment.

Logs

System Info

Diffusers==0.36.0
Python==3.13

Who can help?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions