Skip to content
15 changes: 9 additions & 6 deletions docs/source/en/api/models/anyflow_far_transformer3d.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,19 +13,22 @@ specific language governing permissions and limitations under the License.
# AnyFlowFARTransformer3DModel

The causal (FAR) 3D Transformer used by [`AnyFlowFARPipeline`](../pipelines/anyflow#anyflowfarpipeline) —
the FAR variant of [AnyFlow](https://huggingface.co/papers/2605.13724) (Yuchao Gu, Guian Fang et al., NUS
ShowLab × NVIDIA). It extends the v0.35.1 Wan2.1 backbone with three additions:
the FAR variant of [AnyFlow](https://huggingface.co/papers/2605.13724). See the
[`AnyFlowFARPipeline`](../pipelines/anyflow) page for paper, authors, and released checkpoints. It extends
the v0.35.1 Wan2.1 backbone with three additions:

1. **FAR causal block-mask** via `torch.nn.attention.flex_attention`, supporting frame-level autoregressive
generation as introduced in [FAR (Gu et al., 2025)](https://arxiv.org/abs/2503.19325).
1. **FAR causal block-mask** via `torch.nn.attention.flex_attention`, supporting chunk-wise autoregressive
generation as introduced in [FAR](https://huggingface.co/papers/2503.19325).
2. **Compressed-frame patch embedding** (`far_patch_embedding`) for context (already-generated) frames,
warm-started from the full-resolution `patch_embedding` at construction time via trilinear interpolation.
3. **Dual-timestep flow-map embedding** (same as
[`AnyFlowTransformer3DModel`](anyflow_transformer3d)) — every forward call conditions on both the source
timestep ``t`` and the target timestep ``r``.

The chunk schedule (`chunk_partition`) is **not** baked into the model config. It is a per-call argument to
`forward`, so the same checkpoint handles different `num_frames` configurations without retraining.
The default chunk schedule (`chunk_partition`) is stored in the model config; the released NVIDIA AnyFlow-FAR
checkpoints use `[1, 3, 3, 3, 3, 3, 3, 2]` for the canonical 81-frame setting. `forward` accepts a per-call
`chunk_partition` override, so the same checkpoint also handles other `num_frames` configurations without
retraining.

```python
from diffusers import AnyFlowFARTransformer3DModel
Expand Down
7 changes: 4 additions & 3 deletions docs/source/en/api/models/anyflow_transformer3d.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,11 @@ The bidirectional 3D Transformer used by [`AnyFlowPipeline`](../pipelines/anyflo
v0.35.1 Wan2.1 backbone with one structural change: the timestep embedder is replaced by
``AnyFlowDualTimestepTextImageEmbedding``, so every forward call conditions on both the source timestep
``t`` and the target timestep ``r``. This is the embedding required to learn the flow map
:math:`\Phi_{r\leftarrow t}` introduced in
[AnyFlow](https://huggingface.co/papers/2605.13724) (Yuchao Gu, Guian Fang et al., NUS ShowLab × NVIDIA).
$\Phi_{r\leftarrow t}$ introduced in
[AnyFlow](https://huggingface.co/papers/2605.13724). See the [`AnyFlowPipeline`](../pipelines/anyflow) page
for paper, authors, and released checkpoints.

For frame-level autoregressive (FAR causal) generation, use
For chunk-wise autoregressive (FAR causal) generation, use
[`AnyFlowFARTransformer3DModel`](anyflow_far_transformer3d) instead.

```python
Expand Down
124 changes: 49 additions & 75 deletions docs/source/en/api/pipelines/anyflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,68 +20,28 @@ specific language governing permissions and limitations under the License.

# AnyFlow

[AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation](https://huggingface.co/papers/2605.13724) by Yuchao Gu, Guian Fang and collaborators at [NUS ShowLab](https://sites.google.com/view/showlab) in collaboration with NVIDIA.
[AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation](https://huggingface.co/papers/2605.13724) from NVIDIA, National University of Singapore, and Massachusetts Institute of Technology, by Yuchao Gu, Guian Fang, Yuxin Jiang, Weijia Mao, Song Han, Han Cai, Mike Zheng Shou.

> **TL;DR:** AnyFlow is the first any-step video diffusion framework built on flow maps, which enables a single model (bidirectional or causal) to adapt to arbitrary inference budgets.

*Few-step video generation has been significantly advanced by consistency models. However, their performance often degrades in any-step video diffusion models due to the fixed-point formulation. To address this limitation, we present AnyFlow, the first any-step video diffusion distillation framework built on flow maps. Instead of learning only the mapping z_t → z_0, AnyFlow learns transitions z_t → z_r over arbitrary time intervals, enabling a single model to adapt to different inference budgets. We design an improved forward flow map training recipe that fine-tunes pretrained video diffusion models into flow map models, and introduce Flow Map Backward Simulation to enable on-policy distillation for flow map models. Extensive experiments across both bidirectional and causal architectures, at scales ranging from 1.3B to 14B, on text-to-video and image-to-video tasks demonstrate that AnyFlow outperforms consistency-based baselines while preserving high fidelity and flexible sampling under varying step budgets.*

The original training code is at [`NVlabs/AnyFlow`](https://github.com/NVlabs/AnyFlow). The project page is at [nvlabs.github.io/AnyFlow](https://nvlabs.github.io/AnyFlow).
The AnyFlow pipelines were contributed by the AnyFlow Team. The original code is available on [GitHub](https://github.com/NVlabs/AnyFlow), the project page is at [nvlabs.github.io/AnyFlow](https://nvlabs.github.io/AnyFlow), and pretrained models can be found in the [nvidia/anyflow](https://huggingface.co/collections/nvidia/anyflow) collection on Hugging Face.

The following AnyFlow checkpoints are supported:
Available Models:

| Checkpoint | Backbone | Description |
|------------|----------|-------------|
| [`nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers) | Wan2.1 1.3B | Bidirectional T2V, lightweight |
| [`nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers) | Wan2.1 14B | Bidirectional T2V, full quality |
|---|---|---|
| [`nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers) | Wan2.1 1.3B | Bidirectional T2V |
| [`nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers) | Wan2.1 14B | Bidirectional T2V |
| [`nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers) | FAR + Wan2.1 1.3B | Causal T2V / I2V / V2V |
| [`nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers) | FAR + Wan2.1 14B | Causal T2V / I2V / V2V |

All four are grouped under the [`nvidia/anyflow`](https://huggingface.co/collections/nvidia/anyflow) Hugging Face collection.

> [!TIP]
> Choose `AnyFlowPipeline` for traditional bidirectional text-to-video generation. Choose `AnyFlowFARPipeline` for streaming I2V, video continuation (V2V), or any setup that benefits from frame-by-frame autoregressive sampling.

> [!TIP]
> AnyFlow supports any-step sampling: a single distilled checkpoint can be evaluated at 1, 2, 4, 8, 16... NFE without retraining. Quality scales monotonically with steps in our benchmarks.

### Optimizing Memory and Inference Speed

<hfoptions id="optimization">
<hfoption id="memory">

```py
import torch
from diffusers import AnyFlowPipeline
from diffusers.hooks import apply_group_offloading

pipe = AnyFlowPipeline.from_pretrained(
"nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
)
apply_group_offloading(pipe.transformer, onload_device="cuda", offload_type="leaf_level")
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
```

</hfoption>
<hfoption id="inference speed">

```py
import torch
from diffusers import AnyFlowPipeline

pipe = AnyFlowPipeline.from_pretrained(
"nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")
pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune-no-cudagraphs")
```

</hfoption>
</hfoptions>
> `AnyFlowPipeline` is designed for bidirectional diffusion models in text-to-video (T2V) generation. `AnyFlowFARPipeline` is a chunk-wise causal diffusion model that supports text-to-video (T2V) generation, image-to-video (I2V) generation, and video continuation (V2V).

### Generation with AnyFlow (Bidirectional T2V)

<hfoptions id="anyflow-bidi">
<hfoption id="usage">

```py
import torch
from diffusers import AnyFlowPipeline
Expand All @@ -91,14 +51,16 @@ pipe = AnyFlowPipeline.from_pretrained(
"nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")

prompt = "A red panda eating bamboo in a forest, cinematic lighting"
video = pipe(prompt, num_inference_steps=4, num_frames=33).frames[0]
export_to_video(video, "out.mp4", fps=16)
prompt = (
"An astronaut runs smoothly and appears almost weightless on the lunar surface, "
"as seen from a low-angle shot that highlights the vast, desolate background of the moon. "
"The moon's craters and rocky terrain are clearly visible, creating a stark contrast against "
"the running astronaut who moves with graceful, fluid motions."
)
video = pipe(prompt, num_inference_steps=4, num_frames=81).frames[0]
export_to_video(video, "anyflow_t2v.mp4", fps=16)
```

</hfoption>
</hfoptions>

### Generation with AnyFlow (FAR Causal)

The causal pipeline selects between T2V / I2V / V2V via the ``video`` (or ``video_latents``) argument:
Expand All @@ -108,10 +70,10 @@ clip for V2V continuation. If you already have pre-encoded latents in the model
``video_latents=<tensor>`` to skip VAE encoding. ``video`` and ``video_latents`` are mutually exclusive.

> [!IMPORTANT]
> `AnyFlowFARPipeline.default_chunk_partition = [1, 3, 3, 3, 3, 3, 3, 2]` (sum 21) is matched to the
> released checkpoints' canonical 81 raw frames (21 latent frames at the VAE temporal stride of 4). When
> you change `num_frames`, you must also pass a matching `chunk_partition` summing to
> `(num_frames - 1) // 4 + 1`, otherwise the pipeline raises an `AssertionError`.
> The released checkpoints bake `chunk_partition=[1, 3, 3, 3, 3, 3, 3, 2]` (sum 21) into the transformer
> config, matched to the canonical 81 raw frames (21 latent frames at the VAE temporal stride of 4). When
> you change `num_frames`, pass a matching `chunk_partition` summing to `(num_frames - 1) // 4 + 1`,
> otherwise the pipeline raises a `ValueError`.

<hfoptions id="anyflow-far">
<hfoption id="t2v">
Expand All @@ -125,12 +87,12 @@ pipe = AnyFlowFARPipeline.from_pretrained(
"nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")

video = pipe(
prompt="A cat surfing a wave, sunset",
num_inference_steps=4,
num_frames=81,
).frames[0]
export_to_video(video, "out.mp4", fps=16)
prompt = (
"An astronaut runs smoothly and appears almost weightless on the lunar surface, "
"as seen from a low-angle shot that highlights the vast, desolate background of the moon."
)
video = pipe(prompt, num_inference_steps=4, num_frames=81).frames[0]
export_to_video(video, "anyflow_far_t2v.mp4", fps=16)
```

</hfoption>
Expand All @@ -146,18 +108,25 @@ pipe = AnyFlowFARPipeline.from_pretrained(
"nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")

# Wrap the conditioning image as a one-frame video tensor: (1, 1, 3, H, W) in [0, 1].
first_frame = load_image("path/to/first_frame.png").resize((832, 480))
# Example conditioning image from the AnyFlow repo.
first_frame = load_image(
"https://raw.githubusercontent.com/NVlabs/AnyFlow/main/assets/evaluation/example/images/1.jpg"
).resize((832, 480))
arr = np.asarray(first_frame).astype("float32") / 255.0 # (480, 832, 3)
context_tensor = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).unsqueeze(1).to("cuda")
context_tensor = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).unsqueeze(1).to("cuda") # (1, 1, 3, 480, 832)

prompt = (
"A towering, battle-scarred humanoid robot, reminiscent of a Transformer with powerful, segmented armor "
"and glowing red optics, walking through the skeletal remains of a city ruin. Twisted metal and shattered "
"concrete crunch under its heavy steps, as the robot scans the desolate, dust-choked skyline under an dark sky."
)
video = pipe(
prompt="a cat walks across a sunlit lawn",
prompt=prompt,
video=context_tensor,
num_inference_steps=4,
num_frames=81,
).frames[0]
export_to_video(video, "out.mp4", fps=16)
export_to_video(video, "anyflow_far_i2v.mp4", fps=16)
```

</hfoption>
Expand All @@ -173,21 +142,26 @@ pipe = AnyFlowFARPipeline.from_pretrained(
"nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")

# Context clip — 9 raw frames map to 3 latent frames (9 = 4·2 + 1, 3 = 2 + 1).
context_frames = load_video("path/to/context.mp4")[:9]
# Example conditioning clip from the AnyFlow repo — take the first 9 frames (3 latent frames at VAE temporal stride 4).
context_frames = load_video(
"https://raw.githubusercontent.com/NVlabs/AnyFlow/main/assets/evaluation/example/videos/2.mp4"
)[:9]
arr = np.stack([np.asarray(f.resize((832, 480))) for f in context_frames]).astype("float32") / 255.0
# np.stack gives (T, H, W, C) = (9, 480, 832, 3) → permute to (T, C, H, W) then add batch.
context_tensor = torch.from_numpy(arr).permute(0, 3, 1, 2).unsqueeze(0).to("cuda") # (1, 9, 3, 480, 832)

prompt = (
"A focused trail runner's powerful strides through a dense, sun-dappled forest. "
"The camera tracks alongside, highlighting muscular exertion, sweat, and determined facial expression."
)
video = pipe(
prompt="continue the story",
prompt=prompt,
video=context_tensor,
num_inference_steps=4,
num_frames=81,
# Override chunk_partition so the first chunk covers exactly the 3 latent context frames.
chunk_partition=[3, 3, 3, 3, 3, 3, 3],
).frames[0]
export_to_video(video, "out.mp4", fps=16)
export_to_video(video, "anyflow_far_v2v.mp4", fps=16)
```

</hfoption>
Expand Down
41 changes: 7 additions & 34 deletions docs/source/zh/using-diffusers/anyflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ NFE 增加反而经常掉点。
采样步之间的 re-noising;on-policy 蒸馏阶段额外用 **DMD 反向散度监督** + **Flow-Map backward simulation**
(3 段 shortcut)补上 consistency 蒸馏遗留的 exposure-bias 缺口。

AnyFlow 由 Yuchao Gu、Guian Fang 等人在 [NUS ShowLab](https://sites.google.com/view/showlab) 与 NVIDIA 合作完成。原始训练代码在 [`NVlabs/AnyFlow`](https://github.com/NVlabs/AnyFlow),项目主页是 [nvlabs.github.io/AnyFlow](https://nvlabs.github.io/AnyFlow)4 个发布 checkpoint 归在 [`nvidia/anyflow`](https://huggingface.co/collections/nvidia/anyflow) Hugging Face collection 里。
AnyFlow 由 NVIDIA、新加坡国立大学(NUS)和 MIT 合作完成,作者为 Yuchao Gu、Guian Fang、Yuxin Jiang、Weijia Mao、Song Han、Han Cai、Mike Zheng Shou。原始训练代码在 [`NVlabs/AnyFlow`](https://github.com/NVlabs/AnyFlow),项目主页是 [nvlabs.github.io/AnyFlow](https://nvlabs.github.io/AnyFlow)4 个发布 checkpoint 归在 [`nvidia/anyflow`](https://huggingface.co/collections/nvidia/anyflow) Hugging Face collection 里。

本文档梳理实战要点:怎么选 pipeline、怎么用 any-step 采样、怎么把 AnyFlow 嵌进 T2V / I2V / V2V 工作流。

Expand Down Expand Up @@ -100,7 +100,7 @@ prompt = "森林里一只小熊猫在啃竹子,电影感光照"
for nfe in [1, 2, 4, 8, 16, 32]:
# 每轮重建 generator —— 这样跨步数对比时唯一变量是 NFE。
generator = torch.Generator("cuda").manual_seed(0)
video = pipe(prompt, num_inference_steps=nfe, num_frames=33, generator=generator).frames[0]
video = pipe(prompt, num_inference_steps=nfe, num_frames=81, generator=generator).frames[0]
export_to_video(video, f"out_nfe{nfe}.mp4", fps=16)
```

Expand All @@ -125,11 +125,11 @@ Causal pipeline 用同一个蒸馏模型支持三种任务模式,**通过 `vid
Context tensor 的帧数必须满足 `T = 4n + 1`,跟 VAE 时间步长对齐。

> [!IMPORTANT]
> FAR pipeline 是分块 (chunk) rollout,`num_frames` 必须配合 chunk 调度。默认
> `chunk_partition=[1, 3, 3, 3, 3, 3, 3, 2]`(求和 21)对应发布 checkpoint 的标准 `num_frames=81`
> (21 = (81 − 1) // 4 + 1)。改 `num_frames` 时**必须**显式传匹配的 `chunk_partition`,使其求和等于
> `(num_frames - 1) // 4 + 1`,否则 pipeline 会抛 `AssertionError`。比如 `num_frames=33` 对应 9 个 latent
> 帧,可用 `chunk_partition=[1, 4, 4]`。
> FAR pipeline 是分块 (chunk) rollout,`num_frames` 必须配合 chunk 调度。发布的 checkpoint 在
> transformer config 里写入 `chunk_partition=[1, 3, 3, 3, 3, 3, 3, 2]`(求和 21),对应标准
> `num_frames=81`(21 = (81 − 1) // 4 + 1)。改 `num_frames` 时**必须**显式传匹配的 `chunk_partition`,
> 使其求和等于 `(num_frames - 1) // 4 + 1`,否则 pipeline 会抛 `ValueError`。比如 `num_frames=33` 对应
> 9 个 latent 帧,可用 `chunk_partition=[1, 4, 4]`。

```py
import numpy as np
Expand Down Expand Up @@ -183,33 +183,6 @@ export_to_video(video, "v2v.mp4", fps=16)
如果你已经有 VAE 编码过的 latent,可以直接传 `video_latents=<tensor>` 跳过 `vae_encode` 步骤
(和 `video` 互斥)。

## 显存与推理速度

14B 的 AnyFlow 模型用 group offload + VAE slicing 单卡 40 GB 能跑:

```py
import torch
from diffusers import AnyFlowPipeline
from diffusers.hooks import apply_group_offloading

pipe = AnyFlowPipeline.from_pretrained(
"nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
)
apply_group_offloading(pipe.transformer, onload_device="cuda", offload_type="leaf_level")
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
```

延迟方面,`torch.compile` 对 transformer(最重的模块)效果很好:

```py
pipe = pipe.to("cuda")
pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune-no-cudagraphs")
```

编译开销跑几步就摊销掉;配合 AnyFlow 的低 NFE(4-8 步),`torch.compile` 在 14B 上相比 eager
模式有明显加速。

## LoRA 微调

两个 pipeline 都复用 [`WanLoraLoaderMixin`](../api/loaders/lora),因此为对应 Wan2.1 backbone 训练的
Expand Down
12 changes: 10 additions & 2 deletions scripts/convert_anyflow_to_diffusers.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,13 +57,21 @@
"AnyFlow-FAR-Wan2.1-1.3B-Diffusers": {
"base_model": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
"transformer_cls": AnyFlowFARTransformer3DModel,
"transformer_kwargs": {"full_chunk_limit": 3, "compressed_patch_size": [1, 4, 4]},
"transformer_kwargs": {
"full_chunk_limit": 3,
"compressed_patch_size": [1, 4, 4],
"chunk_partition": [1, 3, 3, 3, 3, 3, 3, 2],
},
"pipeline_cls": AnyFlowFARPipeline,
},
"AnyFlow-FAR-Wan2.1-14B-Diffusers": {
"base_model": "Wan-AI/Wan2.1-T2V-14B-Diffusers",
"transformer_cls": AnyFlowFARTransformer3DModel,
"transformer_kwargs": {"full_chunk_limit": 3, "compressed_patch_size": [1, 4, 4]},
"transformer_kwargs": {
"full_chunk_limit": 3,
"compressed_patch_size": [1, 4, 4],
"chunk_partition": [1, 3, 3, 3, 3, 3, 3, 2],
},
"pipeline_cls": AnyFlowFARPipeline,
},
"AnyFlow-Wan2.1-T2V-1.3B-Diffusers": {
Expand Down
Loading
Loading