Is Lumina2Pipeline's mu calculation correct?

### Describe the bug

Description

While reviewing the current main-branch implementation of pipeline_lumina2, I noticed a potential bug in the calculation of mu within the pipeline's __call__.

In the following section of the code:

https://github.com/huggingface/diffusers/blob/5ffb65803d0ddc5e3298c35df638ceed5e580922/src/diffusers/pipelines/lumina2/pipeline_lumina2.py#L484-L503

The latent tensor appears to have the shape:

(batch_size, num_channels_latents, height, width)


However, later in the same file:

https://github.com/huggingface/diffusers/blob/5ffb65803d0ddc5e3298c35df638ceed5e580922/src/diffusers/pipelines/lumina2/pipeline_lumina2.py#L699-L706

the value latent.shape[1] (i.e., num_channels_latents) is passed as the argument for image_seq_len when computing mu.
This seems incorrect, since image_seq_len should represent the number of image tokens or sequence length, not the number of latent channels.

Expected Behavior

image_seq_len should likely correspond to the number of spatial tokens derived from (height, width) (or another tokenization step), rather than the number of latent channels.

Actual Behavior

The current implementation uses latent.shape[1] as image_seq_len, which likely leads to unintended behavior in the computation of mu and subsequent sampling steps.

Suggested Fix

Review the logic where image_seq_len is passed, and ensure it reflects the correct sequence length dimension (possibly derived from spatial resolution or token count, rather than channel count).

### Reproduction

At the moment, I don’t have a copy/paste runnable MRE because this was identified via manual logic review rather than reproducing the behavior in a runtime environment.

### Logs

```shell

```

### System Info

Diffusers==0.36.0
Python==3.13

### Who can help?

_No response_

	def prepare_latents(self, batch_size, num_channels_latents, height, width, dtype, device, generator, latents=None):
	# VAE applies 8x compression on images but we must also account for packing which requires
	# latent height and width to be divisible by 2.
	height = 2 * (int(height) // (self.vae_scale_factor * 2))
	width = 2 * (int(width) // (self.vae_scale_factor * 2))

	shape = (batch_size, num_channels_latents, height, width)

	if isinstance(generator, list) and len(generator) != batch_size:
	raise ValueError(
	f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
	f" size of {batch_size}. Make sure the batch size matches the length of the generators."
	)

	if latents is None:
	latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
	else:
	latents = latents.to(device)

	return latents

	image_seq_len = latents.shape[1]
	mu = calculate_shift(
	image_seq_len,
	self.scheduler.config.get("base_image_seq_len", 256),
	self.scheduler.config.get("max_image_seq_len", 4096),
	self.scheduler.config.get("base_shift", 0.5),
	self.scheduler.config.get("max_shift", 1.15),
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is Lumina2Pipeline's mu calculation correct? #12913

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Is Lumina2Pipeline's mu calculation correct? #12913

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions