Add AnyFlow Any-Step Video Diffusion Pipelines (Bidirectional + FAR Causal) by Enderfga · Pull Request #13745 · huggingface/diffusers

Enderfga · 2026-05-14T03:39:15Z

What does this PR do?

This PR adds pipelines for AnyFlow (paper, project page, official code, model weights), an any-step video diffusion framework built on flow maps. A single distilled checkpoint can be evaluated at 1, 2, 4, 8, 16, 32 NFE without retraining, and quality scales monotonically with steps — unlike consistency-based distillation, which often degrades as NFE grows.

Two new pipelines are added, both on top of a new FlowMapEulerDiscreteScheduler and reusing WanLoraLoaderMixin:

AnyFlowPipeline → AnyFlowTransformer3DModel: bidirectional text-to-video built on the Wan2.1 backbone with an AnyFlowDualTimestepTextImageEmbedding conditioning on the source/target timestep pair (t, r).
AnyFlowFARPipeline → AnyFlowFARTransformer3DModel: frame-level autoregressive variant (block-sparse causal flex_attention + KV cache + compressed-frame patch embedding) jointly handling T2V / I2V / V2V through one context_sequence argument.

Four checkpoints are released under the nvidia/anyflow collection (Wan2.1-T2V-{1.3B,14B} bidi + FAR-Wan2.1-{1.3B,14B} causal). All four have been validated bit-exact against the official NVlabs/AnyFlow reference on H200: forward L2 = 0.00e+00 for scheduler / transformer / bidi pipeline / FAR pipeline; backward grad delta is 4.88e-04, attributable to bf16 kernel non-determinism only (PR-vs-PR = PR-vs-reference, ratio 1.000); inference latency matches the reference at ±0.0% on both pipelines.

T2V inference example:

import torch
from diffusers import AnyFlowPipeline
from diffusers.utils import export_to_video

pipe = AnyFlowPipeline.from_pretrained(
    "nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")

prompt = "A red panda eating bamboo in a forest, cinematic lighting"
video = pipe(prompt, num_inference_steps=4, num_frames=33).frames[0]
export_to_video(video, "anyflow_t2v.mp4", fps=16)

I2V inference example with the FAR pipeline (single conditioning frame → autoregressive rollout):

import numpy as np
import torch
from diffusers import AnyFlowFARPipeline
from diffusers.utils import export_to_video, load_image

pipe = AnyFlowFARPipeline.from_pretrained(
    "nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")

first_frame = load_image("path/to/first_frame.png").resize((832, 480))
arr = np.asarray(first_frame).astype("float32") / 255.0
context = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).unsqueeze(2).to("cuda")

video = pipe(
    prompt="a cat walks across a sunlit lawn",
    context_sequence={"raw": context},
    num_inference_steps=4,
    num_frames=81,
).frames[0]
export_to_video(video, "anyflow_i2v.mp4", fps=16)

Documentation: EN tutorial at docs/source/en/using-diffusers/anyflow.md, ZH tutorial at docs/source/zh/using-diffusers/anyflow.md, and three API pages (pipelines + two transformer model pages). Tests: 22 fast tests (transformer + scheduler, CPU) plus four pipeline test files, with slow integration tests gated on RUN_SLOW=1 @require_torch_accelerator for the released checkpoints.

anyflow-pr-presentation.mp4

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@yiyixuxu @asomoza

…vel imports This is the lazy-loader scaffolding only. Body files (pipeline_anyflow.py, pipeline_anyflow_causal.py, transformer_anyflow.py, scheduling_flow_map_euler_discrete.py) come in subsequent commits.

The flow-map scheduler advances samples from timestep t to caller-provided target r in a single Euler step, supporting any-step sampling on flow-map- distilled checkpoints. It is a general-purpose scheduler — not specific to the AnyFlow checkpoints. Tests: 12 standalone tests covering instantiation, set_timesteps endpoints, shift identity/monotonicity, step shape preservation, zero-interval identity, one-shot sampling, train weight schemes, scale_noise endpoints. Docs: api/schedulers/flow_map_euler_discrete.md

A 3D DiT extending the v0.35.1 Wan2.1 backbone with two config-toggled modules: * FAR causal blocks (init_far_model=True): block-sparse causal attention via flex_attention + compressed-frame patch embedding for frame-level autoregressive generation (Gu et al., 2025, arXiv:2503.19325). * Dual-timestep flow-map embedding (init_flowmap_model=True): adds a delta timestep embedder enabling flow-map sampling z_t -> z_r over arbitrary intervals (AnyFlow). With both flags off, the model reduces to stock Wan2.1. The class is intentionally self-contained rather than annotated with '# Copied from diffusers.models.transformers.transformer_wan' because upstream Wan has been refactored extensively since v0.35.1 (new WanAttention class, different processor architecture). Tests: 9 unit tests covering construction in 3 modes, bidi forward shape and determinism, return_dict variants, save/load round-trip with and without init_far_model, gradient checkpointing toggle. Docs: api/models/anyflow_transformer3d.md

* AnyFlowPipeline (pipeline_anyflow.py, ~590 LOC): bidirectional T2V using flow-map sampling. Loads checkpoints from nvidia/AnyFlow-Wan2.1-T2V-{1.3B,14B}. * AnyFlowCausalPipeline (pipeline_anyflow_causal.py, ~700 LOC): FAR-based causal pipeline supporting T2V/I2V/TV2V via task_type kwarg. Loads checkpoints from nvidia/AnyFlow-FAR-Wan2.1-{1.3B,14B}-Diffusers. Both pipelines reuse stock WanLoraLoaderMixin, AutoencoderKLWan, UMT5EncoderModel, and AutoTokenizer from upstream. The transformer is the AnyFlowTransformer3DModel introduced in the previous commit. The scheduler is FlowMapEulerDiscreteScheduler. Tests: * tests/pipelines/anyflow/test_anyflow.py: PipelineTesterMixin fast tests + slow integration test against nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers. * tests/pipelines/anyflow/test_anyflow_causal.py: same structure for FAR variant. Reference slices for slow integration tests are deferred to Phase 7 (Final quality pass) where the user runs them on a real GPU.

Modeled on the Helios pipeline doc (PR huggingface#13208). Sections: paper link + abstract, supported checkpoints table, memory/speed optimization tabs, T2V/I2V/TV2V examples for both bidirectional and causal variants, autodoc trailers.

…ersion script * Register AnyFlowPipeline in AUTO_TEXT2VIDEO_PIPELINES_MAPPING. * AnyFlowCausalPipeline is intentionally NOT registered for AutoPipeline because its task switch (t2v / i2v / tv2v) is too rich for a single auto-resolve key. * scripts/convert_anyflow_to_diffusers.py: convert .pt training checkpoints (with 'ema' state dict) into a diffusers save_pretrained layout. Supports all 4 released NVIDIA AnyFlow variants. Replaces the omegaconf-based config in the upstream repo with argparse to match other diffusers conversion scripts.

* ruff format pass on all 5 source files (long lines + trailing comma fixes) * check_dummies.py --fix_and_overwrite regenerated: - dummy_pt_objects.py: AnyFlowTransformer3DModel + FlowMapEulerDiscreteScheduler - dummy_torch_and_transformers_objects.py: AnyFlowPipeline + AnyFlowCausalPipeline Local fast tests: 21/21 passed - 12 scheduler tests (FlowMapEulerDiscreteScheduler) - 9 transformer tests (AnyFlowTransformer3DModel construction + bidi forward + save/load) The pipeline fast tests in tests/pipelines/anyflow/ require a local dev install that matches the diffusers main branch's transformers >= compatibility floor. The reference slices for slow integration tests (real GPU + 1.3B/14B checkpoints) are intentionally left as TODO stubs to be captured by the user on a real GPU machine before opening the PR.

…torials Critical bug fixes (verified against precision-validation review): * pipeline_anyflow.py / pipeline_anyflow_causal.py: replace hardcoded transformer_dtype = torch.bfloat16 with self.transformer.dtype, so pipe.to("cpu") and PipelineTesterMixin save/load tests do not crash on a dtype mismatch in the patch_embedding conv3d. * transformer_anyflow.py: drop the duplicate `base = base = ...` assignment in _build_causal_mask (was a copy-paste typo carried over from FAR-Dev). * transformer_anyflow.py: drop unused `q_is_context` / `k_is_context` locals and the `# noqa: F841` markers that were silencing the dead-store warning. * transformer_anyflow.py: remove `CacheMixin` from the inheritance list — the pipeline manages KV cache directly, the mixin's interface is unused. * transformer_anyflow.py: guard the module-level `torch.compile(flex_attention)` with try/except so the file imports cleanly on CPU CI / no-Triton machines. * convert_anyflow_to_diffusers.py: replace ad-hoc print warnings with the stdlib logger (warning_once-style) and a module-level basicConfig. Documentation accuracy: * AnyFlowCausalPipeline class docstring + main pipeline doc + EN/ZH tutorial: drop the fictitious `task_type` / `image` / `video` arguments and document the real API: pass `context_sequence={"raw": tensor}` (or `{"latent": ...}`) to switch between T2V (None) / I2V (1-frame) / TV2V (4n+1-frame) modes. * Pipeline class docstrings + main doc: explicitly describe AnyFlow's two-stage LoRA distillation including DMD reverse-divergence supervision with Flow-Map backward simulation in stage 2 (was previously implicit). * training_rollout: add detailed docstring explaining its role as the 3-segment Flow-Map backward simulation entry point used during DMD training. * Long-form tutorial doc `using-diffusers/anyflow.md` (EN, 239 LOC) and Chinese mirror `docs/source/zh/using-diffusers/anyflow.md` (224 LOC) added and registered in both `_toctree.yml` files. Tests: * Skip `test_attention_slicing_forward_pass` in both pipeline test classes with a clear rationale (custom attention processor does not support slicing). * All 21 standalone tests still pass (12 scheduler + 9 transformer). Quality gates: * `ruff check` clean across all AnyFlow files. * `ruff format --check` reports 6 files already formatted. * `python utils/check_copies.py` reports no diff. Out of scope for this commit (deferred until reviewer feedback): * Splitting AnyFlowTransformer3DModel into bidi + causal subclasses * Unifying _forward_inference / _forward_cache return types * Migrating model tests from plain unittest to BaseModelTesterConfig + mixins * HF model card / config.json metadata updates on the nvidia/* repos (push to Hub manually before opening the PR)

… output Round 2 of review feedback. Three groups of changes; transformer state-dict keys, module hierarchy, and tensor flow are unchanged so the H200 bit-exact validation remains valid. A. Pipeline rename (mechanical, no behavior change): * Class: AnyFlowCausalPipeline -> AnyFlowFARPipeline (Causal in diffusers usually means an attention mask; AnyFlow's variant is FAR autoregressive, so the FAR name is more specific and matches the paper). * File: pipeline_anyflow_causal.py -> pipeline_anyflow_far.py (git mv). * Test file: test_anyflow_causal.py -> test_anyflow_far.py (git mv). * All references updated in src/, tests/, docs/, scripts/, plus stale anyflowcausalpipeline anchor links in tutorial markdown. B. Pipeline test bug fixes (closes 19 fast-test failures reported by precision-validation reviewer): * pipeline_anyflow.py / pipeline_anyflow_far.py: __call__ now sets self._num_timesteps = num_inference_steps before the rollout, so the PipelineTesterMixin callback tests can read pipe.num_timesteps. * tests/pipelines/anyflow/test_anyflow_far.py: drop the fictitious task_type="t2v" kwarg that crashed every causal fast test (the FAR pipeline selects mode via context_sequence, not a task_type arg). C. Transformer architecture cleanups (review-driven, no tensor changes): * Replace forward(*args, **kwargs) dispatcher with an explicit signature listing every supported kwarg (hidden_states, timestep, r_timestep, encoder_hidden_states, encoder_hidden_states_image, chunk_partition, clean_hidden_states, clean_timestep, kv_cache, kv_cache_flag, is_causal, attention_kwargs, return_dict). Helps IDE / type-checker / torch.compile tracing. * Drop SimpleNamespace returns. Add AnyFlowFARTransformerOutput (BaseOutput dataclass with sample + kv_cache fields) for the two causal paths that need to also propagate kv_cache (_forward_inference and the newly return_dict-aware _forward_cache). _forward_train and _forward_bidirection now consistently return Transformer2DModelOutput. Pipeline call sites already use return_dict=False with positional unpacking, so the fix is transparent there. Out of scope (deferred until canonical-org HF metadata sync): * Splitting AnyFlowTransformer3DModel into a bidi class plus an AnyFlowFARTransformer3DModel subclass — touches register_to_config keys and would require updating model_index.json on every released checkpoint. * Promoting chunk_partition from register_to_config to a forward-time argument (same reason). * Renaming training_rollout to _denoise — would break callers in the FAR-Dev on-policy trainer that produced the released checkpoints. Local fast tests: 21/21 still pass (12 scheduler + 9 transformer). ruff check, ruff format, and check_copies.py are all clean.

…nk_partition to FAR fast-test fixture Two root causes for the 19 remaining PipelineTesterMixin failures, identified by the H200 reviewer: 1. callback_on_step_end was accepted by __call__ but never invoked. Both pipelines pass it through to training_rollout (and FAR additionally through inference()), and inference_range now fires it after scheduler.step in the standard inference branch: if callback_on_step_end is not None: callback_kwargs = {k: locals()[k] for k in callback_on_step_end_tensor_inputs} callback_outputs = callback_on_step_end(self, i, t, callback_kwargs) latents = callback_outputs.pop("latents", latents) prompt_embeds = ... negative_prompt_embeds = ... `nonlocal prompt_embeds, negative_prompt_embeds` lets the callback rewrite the closure-captured embeddings, matching upstream WanPipeline semantics. The 3-segment grad_timestep training rollout does not invoke the callback; it is intentionally training-only. 2. tests/pipelines/anyflow/test_anyflow_far.py::get_dummy_components built the dummy transformer without a `chunk_partition`, leaving it None on the model config and crashing the pipeline at `sum(self.transformer.config.chunk_partition)`. Set `chunk_partition=[1, 1, 1]` in the fixture (3 chunks of 1 latent frame each, matching the test's num_frames=9 -> 3 latent frames). Local fast tests: 21/21 still pass. ruff check, ruff format, and check_copies.py are all clean.

…ig + rename helpers Major architectural refactor that aligns the integration with diffusers conventions ahead of the canonical-org Hub upload. State-dict keys, module hierarchy, and tensor flow are unchanged so the H200 bit-exact validation remains valid; only the on-disk transformer/config.json fields move. Changes: 1. **Sibling transformer classes** replace the flag-driven single class: * AnyFlowTransformer3DModel — bidirectional only. Drops compressed_patch_size / full_chunk_limit / init_far_model / init_flowmap_model / chunk_partition kwargs (always-on for AnyFlow distilled checkpoints). * AnyFlowFARTransformer3DModel — adds far_patch_embedding + the 3 FAR forward paths (train / cache-prefill / autoregressive inference). * AnyFlowTimeTextImageEmbedding (the legacy single-time embedder used only by the old setup_flowmap_model bootstrap) is removed; both classes now build AnyFlowDualTimestepTextImageEmbedding directly in __init__. * setup_flowmap_model / setup_far_model methods are removed; weight warm-start for far_patch_embedding (trilinear interpolation from patch_embedding) moves into AnyFlowFARTransformer3DModel.__init__. 2. **chunk_partition** is no longer a model config field. The FAR pipeline owns the schedule: * AnyFlowFARPipeline.default_chunk_partition = [1, 3, 3, 3, 3, 3, 3, 2] matches the released 81-frame NVIDIA checkpoints. * AnyFlowFARPipeline.__call__ / _denoise_rollout accept a chunk_partition argument that overrides the default for non-default num_frames. 3. **training_rollout -> _denoise_rollout** rename across both pipelines and all English / Chinese docs that referenced it. Signals the method is internal to the pipeline driver, not a public training API. 4. **Conversion script + tests + docs + registries**: * scripts/convert_anyflow_to_diffusers.py: VARIANTS dict picks the right transformer class per variant; init_far_model / init_flowmap_model / chunk_partition kwargs are removed from the from_pretrained call. * Transformer test file split into AnyFlowTransformer3DModelTest and AnyFlowFARTransformer3DModelTest classes. * Pipeline test fixtures use the right class and pass chunk_partition via get_dummy_inputs (3-frame schedule [1, 1, 1] for the 9-frame test). * New docs page docs/source/en/api/models/anyflow_far_transformer3d.md; anyflow_transformer3d.md rewritten for the bidi-only class. * AnyFlowFARTransformer3DModel registered in src/diffusers/__init__.py, src/diffusers/models/__init__.py, models/transformers/__init__.py and the dummy_pt_objects.py stubs. * docs/source/en/_toctree.yml: new entry for the FAR transformer page. 5. **Cleanups**: * Pipeline __call__ no longer passes is_causal=False to the bidi forward (the bidi class doesn't accept it). * Pipeline class docstrings drop stale references to init_*_model flags. Local tests: 22/22 pass (12 scheduler + 10 transformer covering both classes). ruff check / format / check_copies clean. Hub artifacts (model_index.json, transformer/config.json, scheduler config) need to be regenerated for the released checkpoints; the HF update guide will be delivered separately.

…models.md Hard violations (per official diffusers guidelines): * drop einops dependency — replace 25+ rearrange() calls with native permute/reshape/unflatten in transformer + both pipelines * device-gate torch.float64 — apply_rotary_emb and AnyFlowRotaryPosEmbed now fall back to float32 / complex64 on MPS / NPU; freqs are lazily rebuilt per-device via _build_freqs (matches transformer_wan / transformer_flux pattern) * migrate attention to dispatch_attention_fn — replace direct F.scaled_dot_product_attention calls with dispatch_attention_fn (works with sage / flash / native backends); introduce AnyFlowAttention( AttentionModuleMixin) with _default_processor_cls / _available_processors; rename processors to AnyFlowAttnProcessor / AnyFlowCrossAttnProcessor and declare _attention_backend / _parallel_config class attrs * drop dead config fields — qk_norm and added_kv_proj_dim are pruned from both transformer __init__ signatures and AnyFlowTransformerBlock; AnyFlowAttention is hardcoded to rms-norm-across-heads (the only scheme the released checkpoints use) and has no add_k_proj path (T2V only) * add _repeated_blocks = ["AnyFlowTransformerBlock"] to both transformer classes for compile_repeated_blocks() support (matches Wan) * annotate prepare_latents with `# Copied from diffusers.pipelines.wan. pipeline_wan.WanPipeline.prepare_latents`; the pipeline-side rearrange to (B, T, C, H, W) layout is moved to the call site State-dict keys are preserved (legacy Attention had identical to_q / to_k / to_v / to_out / norm_q / norm_k naming), so existing AnyFlow checkpoints load bit-exactly into the new AnyFlowAttention class. The HF Hub config-update guide is updated correspondingly: transformer/ config.json now drops qk_norm and added_kv_proj_dim alongside the previous init_far_model / init_flowmap_model / chunk_partition removals. 22 fast CPU tests still pass; ruff format / ruff check / check_copies all clean.

…/head-dim fallbacks + KV-cache dtype + num_timesteps Phase 3 migrated bidi + cross-attention to dispatch_attention_fn but the FAR causal path still calls flex_attention directly, which has hard requirements (CPU compile, head_dim >= 16) that fail on PipelineTesterMixin's tiny dummy components. Real ckpts (head_dim=128, CUDA) never hit these branches; bit-exact numerical equivalence with FAR-Dev preserved on all 4 released ckpts (forward 0.00e+00, backward kernel-nondet only, ratio 1.000). Code fixes: 1. AnyFlowRotaryPosEmbed._forward_compressed_frame / _forward_full_frame now short-circuit to an empty tensor when num_frames / height / width is 0. PipelineTesterMixin's dummy VAE has scale_factor_spatial=8, so a 16x16 raw spatial input becomes a 2x2 latent which then floors to 0 against compressed_patch_size=(1, 4, 4); the original `freqs[:0].view(0, k, 1, -1)` reshape was ambiguous in that regime. 2. flex_attention dispatch: split the module-load `torch.compile(flex_attention, dynamic=True)` into `_flex_attention_eager` (always available) plus `_flex_attention_compiled`, with a tiny wrapper that picks compiled for CUDA tensors and eager for CPU. Avoids torch._inductor C++ codegen failures that broke fast tests after `pipe.to("cpu")`. CUDA performance unchanged (L10 benchmark: 0.0% delta on bidi 1.3B fwd, 0.0% delta on FAR causal 1.3B fwd). 3. AnyFlowAttnProcessor (FAR causal branch): when head_dim < 16 (flex_attention's hard minimum) zero-pad q/k/v's last dim to 16 and pass `scale=1/sqrt(original_head_dim)` to flex_attention. Padded value rows contribute 0, so trimming the output back is mathematically equivalent. Released ckpts use head_dim=128 so the branch is never taken in production. 4. pipeline_anyflow_far.encode_kv_cache: replace the hardcoded `latents.to(torch.bfloat16)` with `self.transformer.dtype`. The hardcoded bf16 crashed conv3d on dummy fp32 components ("Input type (BFloat16) and bias type (float) should be the same"); real bf16 ckpts are unaffected. 5. pipeline_anyflow_far._denoise_rollout sets `self._num_timesteps = (len(chunk_partition) - num_context_chunks) * num_inference_steps` before the chunk loop, so PipelineTesterMixin.test_callback_cfg's `pipe.num_timesteps`-based assertion matches the actual number of callback fires (chunks * NFE) instead of the previous hardcoded num_inference_steps. Tests: * test_callback_inputs cannot pass without changing FAR's chunk-wise output semantics — it zeroes latents on the final step and asserts the *entire* output buffer is zero, but only the active chunk's slice is overwritten in a chunk-wise rollout. Marked `@unittest.skip` with a detailed rationale; callback functionality itself is still covered by test_callback_cfg. * Full pytest run on tests/pipelines/anyflow/ + tests/models/transformers/test_models_transformer_anyflow.py + tests/schedulers/test_scheduler_flow_map_euler_discrete.py: 81 passed, 0 failed, 11 skipped. Quality gates: * `ruff check` and `ruff format --check` clean across all AnyFlow files. * `python utils/check_copies.py` clean. * `python utils/check_dummies.py` clean.

User-facing alignment with the official HF Hub model card and the day-of-announcement materials at https://huggingface.co/collections/nvidia/anyflow. * Fill in the arXiv identifier 2605.13724 (5 paper links + 2 BibTeX entries). * Rename TV2V → V2V across docs + pipeline_anyflow{,_far}.py so the diffusers copy uses the same Video-to-Video terminology as the official model card. * Add the [nvidia/anyflow](https://huggingface.co/collections/nvidia/anyflow) HF collection link to the three tutorial intros. * Drop the temporary "guyuchao/* staging" tip from the EN tutorial / API page / ZH tutorial — the nvidia/AnyFlow-*-Diffusers repos are now live. * Wire up NVlabs/AnyFlow (training code) and nvlabs.github.io/AnyFlow (project page) in place of the prior <github-org> / <project-page-url> placeholders. * Cite the authors (Yuchao Gu, Guian Fang et al.) and NUS ShowLab × NVIDIA affiliation in the main tutorial, API pipeline page, and both transformer model pages; BibTeX uses the standard `and others` to elide the full list until the next pass. Working tree, CI gates, and tests after the change: ruff format --check ✓ ruff check ✓ python utils/check_copies.py ✓ python utils/check_dummies.py ✓ pytest tests/models + tests/schedulers (22 fast) ✓ No production code logic changes — only docstring wording inside pipeline files (TV2V → V2V).

Replace the placeholder ``@article{gu2026anyflow, author = {Gu, Yuchao and Fang, Guian and others}, ...}`` block in both the English and Chinese tutorials with the canonical ``@misc{gu2026anyflowanystepvideodiffusion, ...}`` form from arxiv.org/abs/2605.13724, which lists all seven authors: Yuchao Gu, Guian Fang, Yuxin Jiang, Weijia Mao, Song Han, Han Cai, Mike Zheng Shou. Docs-only.

Scheduler - FlowMapEulerDiscreteScheduler.step now returns a FlowMapEulerDiscreteSchedulerOutput dataclass (or tuple with return_dict=False) and uses the conventional positional order (model_output, timestep, sample, r_timestep). - Drop training-only helpers: adaptive_weighting, set_train_weight, get_train_weight, linear_timesteps_weights, and the weight_type config field. - Add scale_model_input no-op for API parity; raise ValueError on missing r_timestep. Transformer - Remove gate_track debug write inside AnyFlowDualTimestepTextImageEmbedding.forward_timestep. - Compile flex_attention lazily on first CUDA call instead of at import time. - Replace assert with ValueError in build_block_mask. - Resolve <arxiv-id> placeholders to 2605.13724. Pipelines (AnyFlowPipeline + AnyFlowFARPipeline) - Add EXAMPLE_DOC_STRING + @replace_example_docstring and full __call__ docstrings covering every argument. - Move use_mean_velocity from __init__ to __call__ so save/load round-trips. - Drop _denoise_rollout's grad_timestep branch (DMD on-policy training rollout), the inner inference_range closure, and the redundant negative-prompt concat. - Replace asserts with ValueError; wire show_progress to tqdm; rename inference -> _inference; remove dead current_timestep property. - Update scheduler.step call sites to the new signature. - Trim class docstrings to inference-only language. Pipeline output - Add Apache 2.0 license header; switch to relative import. Auto pipeline / conversion script - Register AnyFlowFARPipeline in AUTO_IMAGE2VIDEO_PIPELINES_MAPPING and AUTO_VIDEO2VIDEO_PIPELINES_MAPPING. - Document the weights_only=False requirement in the conversion script. Tests - Scheduler tests use the new step signature and verify the Output dataclass contract. - Drop the four obsolete training-weight tests; drop weight_type kwarg from pipeline test fixtures; remove internal milestone names from TODO comments. Docs - Resolve <arxiv-id> in the scheduler docs page. - Trim DMD / on-policy distillation language in EN/ZH tutorials and the pipelines page; the paper abstract quote is preserved verbatim.

# Conflicts: # docs/source/en/_toctree.yml

Enderfga · 2026-05-21T01:46:06Z

Hi @dg845 @sayakpaul — checking in on this one. Quick status:

Full review pass from 5/19 addressed across 4 commits (transformer split, pipeline cleanup, scheduler refactor, regenerated tests). Bit-exact replay (L2 ≈ 0) verified against NVlabs/AnyFlow on H200 for both the bidi and FAR pipelines.
Hub-metadata mismatch reported 5/20 is acknowledged — workaround script (gist) shared so the checkpoints are usable in the interim, and I'll update the nvidia/AnyFlow-*-Diffusers repos to the diffusers class names as soon as this merges (kept it as a follow-up so the metadata move and the code land in one direction).
CI is green, merge conflict with main is resolved, # Copied from / make fix-copies is clean.

Is there anything else you'd like changed before approval, or are we good to merge? Happy to address any remaining items.

Thanks again for the thorough review.

HuggingFaceDocBuilderDev · 2026-05-21T02:01:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…n scheduler tests with N-length timesteps - TestAnyFlowFARTransformer3DTraining: skip test_training / test_training_with_ema / test_gradient_checkpointing_equivalence on CPU. FAR causal self-attn uses torch.nn.attention.flex_attention whose backward kernel is GPU-only. - test_scheduler_flow_map_euler_discrete: assert timesteps is N-length (not N+1) and the sigma=0 r-endpoint lives in self.sigmas[-1]; test_step_one_shot_sampling now exercises r_timestep=None (resolved from sigmas) since N=1 has no timesteps[1].

…all_docstrings main huggingface#13758 added utils/check_forward_call_docstrings.py which requires every signature arg to appear as its own `name (...):` entry under Args:. Expand the bidi and FAR transformer forward docstrings to list each parameter individually.

dg845 · 2026-05-21T05:09:22Z

+    _attention_backend = None
+    _parallel_config = None
+
+    _SUPPORTED_BACKENDS = (None, "flex", "_native_flex")


Suggested change

_attention_backend = None

_parallel_config = None

_SUPPORTED_BACKENDS = (None, "flex", "_native_flex")

_attention_backend = "flex"

_parallel_config = None

_SUPPORTED_BACKENDS = ("flex", "_native_flex")

I think setting the default _attention_backend to "flex" rather than None and removing None from the _SUPPORTED_BACKENDS is cleaner, as only Flex Attention backends are compatible with AnyFlowCausalAttnProcessor. (Using None would generally default to the "native" backend, which isn't compatible.)

Done in ffdc969 — _attention_backend = "flex" default and _SUPPORTED_BACKENDS = ("flex", "_native_flex"). Caught a real bug while verifying: the previous None default would silently fall through to SDPA on backends that ignore BlockMask (visible on mps locally — now raises loudly instead of returning wrong outputs).

dg845 · 2026-05-21T05:12:00Z

+            dropout_p=0.0,
+            is_causal=False,
+            scale=scale,
+            backend="flex",


Suggested change

backend="flex",

backend=self._attention_backend,

Follow up to #13745 (comment): using self._attention_backend instead of hardcoding flex here allows us to use other supported backends such as _native_flex.

Done in ffdc969 — backend=self._attention_backend so _native_flex can be selected explicitly.

dg845 · 2026-05-21T05:13:54Z

+        # complex128, so we downcast to complex64 there.
+        self._freqs_cache: Optional[Tuple[Any, torch.Tensor]] = None
+
+    def _build_freqs(self, device: torch.device) -> torch.Tensor:


Suggested change

def _build_freqs(self, device: torch.device) -> torch.Tensor:

# Copied from diffusers.models.transformers.transformer_anyflow.AnyFlowRotaryPosEmbed._build_freqs

def _build_freqs(self, device: torch.device) -> torch.Tensor:

I think _build_freqs should be the same for both the causal and non-causal RoPE embedding modules, so sync their implementations.

Done in ffdc969 — added # Copied from diffusers.models.transformers.transformer_anyflow.AnyFlowRotaryPosEmbed._build_freqs. make fix-copies runs clean.

dg845 · 2026-05-21T05:14:56Z

+        freqs = torch.cat([freqs_f, freqs_h, freqs_w], dim=-1)
+        return freqs
+
+    def _forward_full_frame(self, num_frames, height, width, device) -> torch.Tensor:


Suggested change

def _forward_full_frame(self, num_frames, height, width, device) -> torch.Tensor:

# Copied from diffusers.models.transformers.transformer_anyflow.AnyFlowRotaryPosEmbed._forward_full_frame

def _forward_full_frame(self, num_frames, height, width, device) -> torch.Tensor:

Similarly, I think _forward_full_frame should be the same between the causal and non-causal RoPE modules.

Done in ffdc969 — same pattern (# Copied from for _forward_full_frame).

dg845 · 2026-05-21T05:17:40Z

+                Pre-VAE conditioning frames of shape `(B, C, T, H, W)` in `[0, 1]`. When provided, the pipeline
+                VAE-encodes them and keeps the corresponding latent prefix fixed during sampling. Mutually exclusive
+                with `video_latents`.


Suggested change

Pre-VAE conditioning frames of shape `(B, C, T, H, W)` in `[0, 1]`. When provided, the pipeline

VAE-encodes them and keeps the corresponding latent prefix fixed during sampling. Mutually exclusive

with `video_latents`.

Pre-VAE conditioning frames of shape `(B, T, C, H, W)` in `[0, 1]`. When provided, the pipeline

VAE-encodes them and keeps the corresponding latent prefix fixed during sampling. Mutually exclusive

with `video_latents`.

I think this needs to be updated as VideoProcessor.preprocess_video expects 5D torch.Tensor inputs to have shape [B, T, C, H, W] instead of [B, C, T, H, W].

Done in ffdc969 — docstring + EXAMPLE_DOC_STRING flipped to (B, T, C, H, W) everywhere in both pipelines. Good catch — video_processor.preprocess_video's 5D contract is (B, T, C, H, W), so the previous (B, C, T, H, W) doc would have silently broken users who followed it literally.

dg845 · 2026-05-21T05:19:07Z

+        # 6. Encode conditioning frames (or accept pre-encoded latents).
+        if video is not None and video_latents is not None:
+            raise ValueError("Provide either `video` or `video_latents`, not both.")
+        if video is not None:


Can we move this check to check_inputs so that we fail earlier?

Done in 7a6643b — both bidi and FAR pipelines now do the video / video_latents mutual-exclusion check inside check_inputs. The FAR-specific (num_frames - 1) % 4 == 0 constraint moved there too, so both fail before any work runs.

dg845 · 2026-05-21T05:20:10Z

+    @torch.no_grad()
+    @torch.no_grad()
+    def encode_video(self, video: torch.Tensor, height: int, width: int) -> torch.Tensor:


Suggested change

@torch.no_grad()

@torch.no_grad()

def encode_video(self, video: torch.Tensor, height: int, width: int) -> torch.Tensor:

@torch.no_grad()

def encode_video(self, video: torch.Tensor, height: int, width: int) -> torch.Tensor:

nit: remove extra @torch.no_grad() decorator.

Done in ffdc969 — duplicate @torch.no_grad() removed. (Per the bot follow-up, the non-duplicate @torch.no_grad() was also dropped from bidi encode_video and FAR encode_kv_cache since __call__ already wraps the no-grad scope.)

dg845 · 2026-05-21T05:22:18Z

+        >>> # Single-frame I2V: wrap the conditioning image as a (1, 3, 1, H, W) tensor in [0, 1].
+        >>> first_frame = load_image("path/to/first_frame.png").resize((832, 480))
+        >>> arr = np.asarray(first_frame).astype("float32") / 255.0
+        >>> context = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).unsqueeze(2).to("cuda")


Suggested change

>>> # Single-frame I2V: wrap the conditioning image as a (1, 3, 1, H, W) tensor in [0, 1].

>>> first_frame = load_image("path/to/first_frame.png").resize((832, 480))

>>> arr = np.asarray(first_frame).astype("float32") / 255.0

>>> context = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).unsqueeze(2).to("cuda")

>>> # Single-frame I2V: wrap the conditioning image as a (1, 1, 3, H, W) tensor in [0, 1].

>>> first_frame = load_image("path/to/first_frame.png").resize((832, 480))

>>> arr = np.asarray(first_frame).astype("float32") / 255.0

>>> context = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).unsqueeze(1).to("cuda")

For the same reason as #13745 (comment), we should input a [B, T, C, H, W] rather than a [B, C, T, H, W] video tensor.

Done in ffdc969 — unsqueeze(0).unsqueeze(1) to produce the (1, 1, 3, H, W) shape per the suggestion.

dg845 · 2026-05-21T05:23:41Z

+            video (`torch.Tensor`, *optional*):
+                Pre-VAE conditioning frames of shape `(B, C, T, H, W)` in `[0, 1]` (`T = 4n + 1`). When provided, the
+                pipeline VAE-encodes them and keeps the corresponding latent prefix fixed during sampling. Mutually
+                exclusive with `video_latents`.


Suggested change

video (`torch.Tensor`, *optional*):

Pre-VAE conditioning frames of shape `(B, C, T, H, W)` in `[0, 1]` (`T = 4n + 1`). When provided, the

pipeline VAE-encodes them and keeps the corresponding latent prefix fixed during sampling. Mutually

exclusive with `video_latents`.

video (`torch.Tensor`, *optional*):

Pre-VAE conditioning frames of shape `(B, T, C, H, W)` in `[0, 1]` (`T = 4n + 1`). When provided, the

pipeline VAE-encodes them and keeps the corresponding latent prefix fixed during sampling. Mutually

exclusive with `video_latents`.

Analogous suggestion to #13745 (comment) for the FAR causal pipeline.

Done in ffdc969 — same fix as the bidi side.

dg845 · 2026-05-21T05:31:17Z

+        if show_progress:
+            chunk_iter = tqdm(chunk_iter)


Instead of using a show_progress argument here, I think we should use nested progress bars like LLaDA-2 does. We can define an outer progress bar:

diffusers/src/diffusers/pipelines/llada2/pipeline_llada2.py

Lines 426 to 428 in f502538

outer_progress_bar_config = getattr(self, "_progress_bar_config", {}).copy()

block_progress_bar_config = {**outer_progress_bar_config, "position": 0, "desc": "Blocks"}

for num_block in tqdm(range(prefill_blocks, num_blocks), **block_progress_bar_config):

and an inner progress bar:

diffusers/src/diffusers/pipelines/llada2/pipeline_llada2.py

Lines 444 to 450 in f502538

inner_progress_bar_config = {

**outer_progress_bar_config,

"position": 1,

"leave": False,

"desc": f"Block {num_block} Inference Steps",

}

progress_bar = tqdm(total=num_inference_steps, **inner_progress_bar_config)

using the pipeline's _progress_bar_config and appropriate arguments to make sure that the inner progress bars don't stack up. This should respect any configuration set through DiffusionPipeline.set_progress_bar_config better (for example, using pipe.set_progress_bar_config(disable=None) to disable the progress bars).

Done in 7a6643b — show_progress argument removed; replaced with nested tqdm bars in the LLaDA-2 pattern (outer Chunks at position=0, inner Inference Steps per chunk at position=1, leave=False). Both pick up DiffusionPipeline._progress_bar_config, so set_progress_bar_config(disable=None) etc. now work as expected.

dg845 · 2026-05-21T05:38:57Z

+        timestep = timestep / self.config.num_train_timesteps
+        r_timestep = r_timestep / self.config.num_train_timesteps


nit: I think getting the underlying t_sigma and r_sigma corresponding to timestep and r_timestep via something like the logic in _resolve_next_timestep or an internal step_idx like FlowMatchEulerDiscreteScheduler uses:

diffusers/src/diffusers/schedulers/scheduling_flow_match_euler_discrete.py

Lines 501 to 503 in f502538

sigma_idx = self.step_index

sigma = self.sigmas[sigma_idx]

sigma_next = self.sigmas[sigma_idx + 1]

would be slightly better here to cover cases where the timesteps and sigmas aren't related through scaling by self.config.num_train_timesteps.

Done across 7a6643b + 84605d5 + 89128cf:

7a6643b: introduced index_for_timestep() and rewrote step() to resolve both t_sigma and r_sigma via self.sigmas[idx] lookups. For the shipped linspace + shift schedule this is bit-identical to the previous t / num_train_timesteps formulation (max abs diff = 0.0 on an 8-step replay), but it stays correct under any future schedule whose timestep / sigma mapping isn't strictly linear.

84605d5: added the full FlowMatchEulerDiscreteScheduler-style state machine — _step_index, _begin_index, step_index / begin_index properties, set_begin_index, _init_step_index. step() lazily initializes and advances the counter on every call so downstream callbacks / composable schedulers observe it. Sigma resolution stays a pure function of the passed-in (timestep, r_timestep) so step() is idempotent (calling it twice with the same args returns identical prev_sample).

89128cf: audit fix — the earlier _init_step_index raised when the first timestep was off-schedule, which contradicted step()'s documented any-step support. _init_step_index now falls back to 0 for off-schedule starts (still a valid observable counter); _resolve_next_timestep was removed since its callers were all inlined.

Bit-exact replay on H200 (random-init bidi + FAR forward, fp32, comparing d0181ea baseline to 84605d5): state_dict missing=0 / unexpected=0, L2 = 0.0e+00, max|Δ| = 0.0e+00.

dg845 · 2026-05-21T05:40:17Z

+    @unittest.skip("AnyFlow has no optional components.")
+    def test_save_load_optional_components(self):
+        pass
+


Suggested change

@unittest.skip("AnyFlow has no optional components.")

def test_save_load_optional_components(self):

pass

I think we don't need to skip test_save_load_optional_components here as this test actually passes.

Done in ffdc969 — skip removed; test passes on CPU.

dg845 · 2026-05-21T05:41:19Z

+    @unittest.skip("AnyFlow has no optional components.")
+    def test_save_load_optional_components(self):
+        pass
+


Suggested change

@unittest.skip("AnyFlow has no optional components.")

def test_save_load_optional_components(self):

pass

Analogous comment as #13745 (comment).

Done in ffdc969 — same as the bidi side.

dg845 · 2026-05-21T05:43:54Z

Hi @claude, can you do a review to see if there is anything I missed?

github-actions · 2026-05-21T05:44:12Z

Claude finished @dg845's task in 5m 22s —— View job

Review Complete

Hi @dg845 — here's my review of the remaining items. Your review was thorough; I'll focus on things I found that you didn't already flag, and then include a summary of your open items for tracking.

New findings (beyond your existing review comments)

1. Unused attention_kwargs parameter in both transformers

Both AnyFlowTransformer3DModel.forward() (transformer_anyflow.py:633) and AnyFlowFARTransformer3DModel.forward() (transformer_anyflow_far.py:923) declare attention_kwargs: Optional[Dict[str, Any]] = None but never use or forward it. In the FAR transformer, it's propagated through the common dict to _forward_cache/_forward_inference/_forward_train, where it's accepted as a parameter but never referenced.

Per models.md: "Don't declare attention_mask (or encoder_hidden_states_mask) in the forward signature if you ignore it." The same principle applies to attention_kwargs. Either use it (pass to the attention processors) or remove it.

2. Misleading far_cfg variable name in bidirectional transformer

The bidirectional transformer (transformer_anyflow.py:667) constructs a far_cfg dict, but "FAR" (Frame-level AutoRegressive) is the causal variant specifically. This name is confusing in the bidi context. Consider renaming to something like layout_cfg or frame_cfg in the bidi transformer. The causal transformer's far_cfg is fine since it actually is FAR-specific.

3. FlowMapEulerDiscreteScheduler is missing _step_index / _begin_index infrastructure

Beyond the _resolve_next_timestep improvement you flagged, the FlowMatchEulerDiscreteScheduler (which this scheduler is modeled after) implements a full step-tracking API: _step_index, _begin_index, step_index property, set_begin_index(), and _init_step_index(). The new scheduler has none of this. While not strictly needed for the current pipelines (since they resolve r from sigmas[i + 1] directly), downstream code may expect schedulers to expose step_index — e.g., IP-Adapter scheduled timesteps, certain callbacks, or composable schedulers.

4. Redundant @torch.no_grad() on encode_video (bidi pipeline) and encode_kv_cache (FAR pipeline)

You flagged the duplicate @torch.no_grad() on the FAR encode_video (line 378). I'd extend this: the non-duplicate @torch.no_grad() on the bidi pipeline's encode_video (pipeline_anyflow.py:354) and the FAR's encode_kv_cache (pipeline_anyflow_far.py:396) are also redundant per pipelines.md: "Redundant inside helpers that __call__ already covers. The decorator puts every descendant in no-grad, so an inner with torch.no_grad(): is noise — and worse, it forecloses callers who want to invoke pipe.encode_prompt(...) with grads enabled."

5. Dead temb.ndim == 2 branch in bidirectional transformer output norm

In transformer_anyflow.py:702-707, the else branch (for temb.ndim != 3) is unreachable. temb comes from condition_embedder.forward() → forward_timestep() which always returns a 3D tensor (B, total_tokens, dim) because of the unflatten + repeat_interleave at line 279. The else branch handling a 2D temb is dead code.

6. Public methods forward_far_patchify and forward_far_patchify_inference should be private

In transformer_anyflow_far.py, forward_far_patchify (line 1000) and forward_far_patchify_inference (line 1024) are public methods only called internally by _forward_train/_forward_cache and _forward_inference respectively. Per models.md, module calls should be visible in forward directly, and methods that are only called internally should be private. These should be _forward_far_patchify / _forward_far_patchify_inference.

7. Minor: comment numbering skips step 4 in both pipelines

Both pipeline_anyflow.py and pipeline_anyflow_far.py jump from comment # 3. Encode input prompt to # 5. Prepare latent variables — step 4 is missing. (Very minor readability nit.)

8. Minor: _build_causal_mask comment numbering

In transformer_anyflow_far.py:1076-1082, the mask_mod function comments skip from # 1) to # 3) (no # 2)).

Suggestions / additional info: dead code trace

I traced the call paths from both pipelines into the transformers:

Bidirectional pipeline → AnyFlowTransformer3DModel:

The pipeline never passes encoder_hidden_states_image (always None). The image embedding path at line 689 is therefore unreachable via the current pipeline — this is fine as advisory since I2V may be added later or users may call the model directly.
attention_kwargs is never consumed (see finding Add glide modeling files #1 above).
The temb.ndim == 2 branch is unreachable (see finding Add UNet for Latent Diffusion #5 above).

FAR pipeline → AnyFlowFARTransformer3DModel:

The pipeline only uses _forward_cache (via encode_kv_cache) and _forward_inference (via the denoising loop). _forward_train is never called by the pipeline. This is expected — it's for training use.
clean_hidden_states and clean_timestep are only used in _forward_train and _forward_cache. The pipeline only passes them as None from the denoising loop (and through encode_kv_cache where they default to None), so the clean_hidden_states is not None branches in _forward_cache are unreachable during pipeline inference. Again expected for a training code path.
attention_kwargs is propagated but never consumed (see finding Add glide modeling files #1).

Summary of dg845's open review items (for tracking)

These are your existing review comments that appear to still be open based on the latest code:

#	File	Issue	Status
1	`transformer_anyflow_far.py:87`	`_attention_backend` should default to `"flex"`, remove `None` from `_SUPPORTED_BACKENDS`	Open
2	`transformer_anyflow_far.py:197`	Use `self._attention_backend` instead of hardcoded `"flex"`	Open
3	`transformer_anyflow_far.py:644`	`_build_freqs` needs `# Copied from` annotation	Open
4	`transformer_anyflow_far.py:720`	`_forward_full_frame` needs `# Copied from` annotation	Open
5	`pipeline_anyflow.py:412`	Video shape docstring: `(B, T, C, H, W)` not `(B, C, T, H, W)`	Open
6	`pipeline_anyflow.py:540`	Move `video`/`video_latents` mutual-exclusion check to `check_inputs`	Open
7	`pipeline_anyflow_far.py:379`	Remove duplicate `@torch.no_grad()`	Open
8	`pipeline_anyflow_far.py:58`	Example: `unsqueeze(1)` not `unsqueeze(2)` for `(B, T, C, H, W)`	Open
9	`pipeline_anyflow_far.py:475`	FAR docstring video shape correction	Open
10	`pipeline_anyflow_far.py:692`	Use nested progress bars instead of `show_progress`/`tqdm`	Open
11	`scheduling_flow_map_euler_discrete.py:189`	Use `_step_index` or `_resolve_next_timestep` for sigma lookup	Open
12	`test_anyflow.py:143`	Remove unnecessary `test_save_load_optional_components` skip	Open
13	`test_anyflow_far.py:154`	Remove unnecessary `test_save_load_optional_components` skip	Open

Overall the PR is well-structured — the transformer split is clean, the scheduler API is sensible, and the # Copied from annotations are correctly used. The main remaining work is the items above and dg845's open comments.

Enderfga · 2026-05-21T05:50:41Z

Thanks for the second pass @dg845 — really appreciate the careful read. Working through all 13 now. A quick map of how I'm going to handle them:

One-click suggestions (FAR processor backend defaults + # Copied from syncs, the video shape fix in both pipelines + EXAMPLE_DOC_STRING, the duplicate @torch.no_grad(), lifting the two test_save_load_optional_components skips) — applying as-is.
Architectural moves — also taking all three: hoist the video/video_latents mutual-exclusion check into check_inputs, switch the FAR pipeline to nested progress bars in the LLaDA-2 pattern (outer chunk / inner timestep, respecting _progress_bar_config), and rework _resolve_next_timestep to index via step_idx like FlowMatchEulerDiscreteScheduler does (no timestep/num_train reverse-derivation).

I'll re-run the bit-exact replay against NVlabs/AnyFlow on H200 after the scheduler rework lands, since that one touches the numeric path. Pushing in a few hours; will reply per-thread as each is done.

FAR transformer: - AnyFlowCausalAttnProcessor: default _attention_backend = 'flex' (was None); remove None from _SUPPORTED_BACKENDS. None previously fell through to SDPA which silently ignored the BlockMask; failing loudly is the right default. - dispatch_attention_fn call: read self._attention_backend instead of hardcoded 'flex', so '_native_flex' selection works. - _build_freqs / _forward_full_frame: add '# Copied from' to bidi RoPE. Pipelines: - bidi + FAR docstrings: video shape (B, C, T, H, W) -> (B, T, C, H, W) to match VideoProcessor.preprocess_video. - FAR EXAMPLE_DOC_STRING: single-frame I2V tensor wrap uses unsqueeze(1) for the T axis instead of unsqueeze(2). - FAR encode_video: drop duplicated @torch.no_grad() decorator. Tests: - test_anyflow / test_anyflow_far: lift the test_save_load_optional_components skip (the test actually passes). - FAR processor smoke test: assert default backend is 'flex' (was 'None').

Pipelines: - check_inputs accepts video / video_latents and raises early on: (a) mutual exclusion (was checked late in __call__); (b) FAR's (num_frames - 1) % 4 == 0 constraint. __call__ no longer carries duplicate validation. - FAR pipeline: drop the show_progress kwarg and replace the single tqdm with nested progress bars in the LLaDA-2 pattern — outer 'Chunks' (position=0) and per-chunk inner 'Inference Steps' (position=1, leave=False) — both picking up DiffusionPipeline._progress_bar_config (so set_progress_bar_config controls them, including disable=None). Scheduler: - step() resolves source and target sigmas by indexing self.sigmas via the new index_for_timestep(), instead of dividing the input timesteps by num_train_timesteps. This keeps the math correct for any future schedule whose timestep/sigma relationship is non-linear. For an off-schedule r_timestep the code falls back to r / num_train_timesteps, so explicit any-step sampling outside the schedule still works (and t off-schedule with r=None still raises a clear ValueError, as before). Numerical equivalence: for the shipped linspace+shift schedule the two formulations are bit-identical (verified: max abs diff = 0.0 over an N=8, shift=5 schedule).

Finding huggingface#1 — attention_kwargs plumbing: Both transformers now decorate forward() with @apply_lora_scale('attention_kwargs') (matches Wan); pipelines forward attention_kwargs to the transformer + encode_kv_cache, and the unused parameter is dropped from the inner _forward_train / _forward_cache / _forward_inference signatures. Pipeline docstrings updated to the standard wording. Finding huggingface#2 — naming: Rename far_cfg -> layout_cfg in the bidi transformer (the bidi path is not FAR; the FAR transformer keeps far_cfg, which is accurate there). Finding huggingface#3 — scheduler state machine: Add _step_index, _begin_index, step_index property, begin_index property, set_begin_index(), _init_step_index(). step() lazily initializes and advances the counter so downstream callbacks / composable schedulers can observe rollout progress. Sigma resolution remains a pure function of (timestep, r_timestep) — calling step() twice with identical args still returns identical prev_sample (idempotent). Finding huggingface#4 — redundant @torch.no_grad(): Drop the redundant decorators on bidi pipeline's encode_video and FAR pipeline's encode_kv_cache (callers are already in __call__'s no-grad scope). Finding huggingface#5 — dead code: Remove the unreachable temb.ndim == 2 else branch from the bidi transformer's output-norm path (condition_embedder.forward always returns a 3D temb). Finding huggingface#6 — private rename: forward_far_patchify[_inference] -> _forward_far_patchify[_inference] (only called internally by _forward_train / _forward_cache / _forward_inference). Finding huggingface#7 — pipeline comment numbering: Bidi + FAR pipelines renumber steps so the # 4. slot is no longer skipped. Finding huggingface#8 — mask-mod comment numbering: _build_causal_mask numbered comments now run 1) 2) 3) ... (was 1) 3) 4) ...). Tests: - New test_step_index_advances + test_set_begin_index_anchors_step_index in the scheduler test file exercise the new state machine. - All existing pipeline / transformer / scheduler tests still pass (85 passed, 85 skipped on CPU). Bit-exact: 8-step rollout vs the previous formulation, max abs diff = 0.0 (the new sigma-lookup is byte-identical to t/num_train_timesteps on this schedule).

…; drop dead _resolve_next_timestep Audit caught two issues in the previous scheduler commit: 1. The new state machine raised in _init_step_index whenever the first timestep wasn't on the active schedule, contradicting the documented contract that step() falls back to t/num_train_timesteps for off-schedule any-step sampling. The fall-back numerics were intact but they were unreachable — the init check fired first. Fix: _init_step_index now initializes _step_index to 0 when the timestep is off-schedule (still a valid observable counter for callbacks). step()'s sigma resolution is untouched, so on-schedule rollouts stay bit-exact and off-schedule any-step sampling actually runs again. Regression test: test_step_off_schedule_anystep_supported. 2. _resolve_next_timestep had no remaining callers after the step() rewrite inlined the same lookup. Removed (private helper, no external API).

- en api/pipelines/anyflow.md: video shape (B, C, T, H, W) -> (B, T, C, H, W); example tensor wrap uses unsqueeze(0).unsqueeze(1) and permute(0, 3, 1, 2) to match VideoProcessor.preprocess_video's 5D contract. - zh using-diffusers/anyflow.md: same shape fixes; also flip the I2V / V2V examples from the obsolete context_sequence={...} dict to the current video= / video_latents= kwargs; helper to_video_tensor returns (1, T, C, H, W); add a note about mutual exclusion.

Enderfga · 2026-05-21T07:22:09Z

Hi @dg845 @sayakpaul — second-pass review fully addressed. Per-thread replies are inline; this is the high-level summary.

dg845's 13 review threads — all applied

Area	Outcome
FAR processor backend (1, 2)	Default `_attention_backend = "flex"`, `_SUPPORTED_BACKENDS = ("flex", "_native_flex")`, `backend=self._attention_backend`
FAR rope `# Copied from` (3, 4)	`_build_freqs` + `_forward_full_frame` synced to bidi; `make fix-copies` clean
`video` shape docs (5, 8, 9)	Pipelines + `EXAMPLE_DOC_STRING` + user guides flipped to `(B, T, C, H, W)` to match `video_processor.preprocess_video`'s 5D contract
Pre-flight validation (6)	`video` / `video_latents` mutual-exclusion + `(num_frames - 1) % 4 == 0` moved into `check_inputs`
FAR pipeline cleanup (7, 10)	Duplicate `@torch.no_grad()` removed; `show_progress` arg dropped in favour of nested tqdm bars (LLaDA-2 pattern, respects `set_progress_bar_config`)
Scheduler (11)	New `index_for_timestep`; `step()` resolves both `t_sigma` / `r_sigma` via sigma lookup; full `_step_index` / `_begin_index` state machine; off-schedule any-step path repaired (`89128cf` audit fix)
Test skip cleanup (12, 13)	`test_save_load_optional_components` un-skipped on both pipelines

Claude bot follow-up — also applied

attention_kwargs: both transformer forward() methods now decorate with @apply_lora_scale("attention_kwargs") (matching WanTransformer3DModel); pipelines plumb the kwarg through to transformer(...) and encode_kv_cache(...); unused parameter is removed from the inner FAR _forward_* signatures.
far_cfg → layout_cfg in the bidirectional transformer (the bidi path isn't FAR; FAR file keeps far_cfg).
Full _step_index / _begin_index state machine on FlowMapEulerDiscreteScheduler — same as item 11 above.
Redundant @torch.no_grad() removed from bidi encode_video + FAR encode_kv_cache (already inside __call__'s no-grad scope).
Dead temb.ndim == 2 branch removed from the bidi transformer's output norm.
forward_far_patchify[_inference] → _forward_far_patchify[_inference] (only called internally).
Pipeline step comments renumbered so # 4. is no longer skipped.
_build_causal_mask mask-mod comments renumbered (1, 2, 3, …).

Bit-exact validation (H200, fp32, random-init bidi + FAR)

Replay comparing d0181ea (pre-second-pass) to current HEAD:

state_dict key match: missing = 0 / unexpected = 0 for both models
bidi forward: L2 = 0.0e+00, max|Δ| = 0.0e+00
FAR forward: L2 = 0.0e+00, max|Δ| = 0.0e+00

So the entire second-pass refactor is provably numerically equivalent on the shipped linspace + shift schedule. The earlier L2 = 0 guarantee against NVlabs/AnyFlow is preserved.

Commits this round

ffdc969 — A (1-click suggestions, 10 of dg845's)
7a6643b — B (refactors: check_inputs hoist, nested progress bars, scheduler sigma-lookup)
84605d5 — Claude bot's 8 findings
89128cf — audit fix: off-schedule any-step regression in _init_step_index
be2aefe — user-guide doc updates for the shape + kwarg fixes

Ready when you are. Happy to iterate further if anything's still off.

.ai/skills/model-integration/SKILL.md is explicit: 'No integration / slow tests in the initial PR — don't add anything gated on @slow / RUN_SLOW=1 yet.' Our two integration test classes were shape-only assertions with TODOs for a future numeric reference, so dropping them loses no actual coverage — the relevant rollouts are covered by H200 bit-exact replay outside the pytest suite. Can land a follow-up PR after merge with proper numeric reference slices once the maintainer is comfortable enabling slow tests.

Enderfga added 15 commits May 6, 2026 14:41

[Pipelines] AnyFlow: scaffold pipelines/anyflow + register all top-le…

507fd9b

…vel imports This is the lazy-loader scaffolding only. Body files (pipeline_anyflow.py, pipeline_anyflow_causal.py, transformer_anyflow.py, scheduling_flow_map_euler_discrete.py) come in subsequent commits.

github-actions Bot added size/L PR with diff > 200 LOC documentation Improvements or additions to documentation models tests utils pipelines schedulers and removed size/L PR with diff > 200 LOC labels May 14, 2026

Merge branch 'main' into add-anyflow-pipeline

8da3679

github-actions Bot added the size/L PR with diff > 200 LOC label May 14, 2026

Enderfga mentioned this pull request May 14, 2026

Load nvidia/AnyFlow-* checkpoints from the diffusers AnyFlow metadata layout NVlabs/AnyFlow#2

Draft

Enderfga and others added 2 commits May 14, 2026 20:57

Merge branch 'main' into add-anyflow-pipeline

76e91f8

dg845 requested review from dg845 and yiyixuxu May 16, 2026 00:16

Merge remote-tracking branch 'upstream/main' into add-anyflow-pipeline

19348ff

# Conflicts: # docs/source/en/_toctree.yml

Enderfga and others added 3 commits May 21, 2026 10:17

Merge branch 'main' into add-anyflow-pipeline

337b36a

dg845 reviewed May 21, 2026

View reviewed changes

Enderfga added 5 commits May 21, 2026 13:56

Enderfga and others added 2 commits May 21, 2026 15:24

Merge branch 'main' into add-anyflow-pipeline

27e16a2

	def _build_freqs(self, device: torch.device) -> torch.Tensor:
	# Copied from diffusers.models.transformers.transformer_anyflow.AnyFlowRotaryPosEmbed._build_freqs
	def _build_freqs(self, device: torch.device) -> torch.Tensor:

	def _forward_full_frame(self, num_frames, height, width, device) -> torch.Tensor:
	# Copied from diffusers.models.transformers.transformer_anyflow.AnyFlowRotaryPosEmbed._forward_full_frame
	def _forward_full_frame(self, num_frames, height, width, device) -> torch.Tensor:

	outer_progress_bar_config = getattr(self, "_progress_bar_config", {}).copy()
	block_progress_bar_config = {**outer_progress_bar_config, "position": 0, "desc": "Blocks"}
	for num_block in tqdm(range(prefill_blocks, num_blocks), **block_progress_bar_config):

	inner_progress_bar_config = {
	**outer_progress_bar_config,
	"position": 1,
	"leave": False,
	"desc": f"Block {num_block} Inference Steps",
	}
	progress_bar = tqdm(total=num_inference_steps, **inner_progress_bar_config)

		timestep = timestep / self.config.num_train_timesteps
		r_timestep = r_timestep / self.config.num_train_timesteps

	sigma_idx = self.step_index
	sigma = self.sigmas[sigma_idx]
	sigma_next = self.sigmas[sigma_idx + 1]

	@unittest.skip("AnyFlow has no optional components.")
	def test_save_load_optional_components(self):
	pass

Conversation

Enderfga commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

Enderfga commented May 21, 2026

Uh oh!

HuggingFaceDocBuilderDev commented May 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dg845 May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dg845 May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dg845 commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review Complete

New findings (beyond your existing review comments)

Suggestions / additional info: dead code trace

Summary of dg845's open review items (for tracking)

Uh oh!

Enderfga commented May 21, 2026

Uh oh!

Enderfga commented May 21, 2026

dg845's 13 review threads — all applied

Enderfga commented May 14, 2026 •

edited

Loading

dg845 May 21, 2026 •

edited

Loading

dg845 May 21, 2026 •

edited

Loading

github-actions Bot commented May 21, 2026 •

edited

Loading