Wan 2.2 T2V-A14B

View as Markdown

Wan 2.2 T2V-A14B is the successor to Wan 2.1, also a text-to-video flow-matching DiT. Its defining feature is a two-stage denoising pipeline: a high-noise transformer handles the early/noisy timesteps and a low-noise transformer_2 handles the later/cleaner timesteps, switching at boundary_ratio * num_train_timesteps (default 0.875). Each transformer is ~14B parameters, for ~28B total.

TaskText-to-Video
ArchitectureDiT (Flow Matching), two-stage
Parameters14B + 14B (high-noise + low-noise)
HF OrgWan-AI

Available Models

  • Wan2.2-T2V-A14B: two transformers, ~14B params each, boundary_ratio 0.875

Task

  • Text-to-Video (T2V)

Example HF Models

ModelHF ID
Wan 2.2 T2V-A14BWan-AI/Wan2.2-T2V-A14B-Diffusers

Example Recipes

RecipeDescription
wan2_2_t2v_flow.yamlFine-tune — two-stage with model.stage knob
generate_wan22.yamlInference — loads both stage checkpoints

Two-stage finetuning workflow

Because each transformer is ~14B parameters, NeMo AutoModel finetunes them one at a time:

  1. Preprocess once — produces a single cached .meta set reusable across both stages:

    $python -m tools.diffusion.preprocessing_multiprocess video \
    > --video_dir /path/to/videos --output_dir /path/to/wan22_cache \
    > --processor wan2.2 --caption_format meta_json --caption_field caption \
    > --resolution_preset 512p --target_frames 81
  2. Finetune the high-noise stage (pipe.transformer, sigma ∈ [boundary_ratio, 1.0]):

    $torchrun --nproc-per-node=8 \
    > examples/diffusion/finetune/finetune.py \
    > -c examples/diffusion/finetune/wan2_2_t2v_flow.yaml \
    > --model.stage=high_noise \
    > --data.dataloader.cache_dir=/path/to/wan22_cache \
    > --checkpoint.checkpoint_dir=./WAN22_CKPT/wan22_high \
    > --fsdp.cpu_offload=true
  3. Finetune the low-noise stage (pipe.transformer_2, sigma ∈ [0.0, boundary_ratio]):

    $torchrun --nproc-per-node=8 \
    > examples/diffusion/finetune/finetune.py \
    > -c examples/diffusion/finetune/wan2_2_t2v_flow.yaml \
    > --model.stage=low_noise \
    > --data.dataloader.cache_dir=/path/to/wan22_cache \
    > --checkpoint.checkpoint_dir=./WAN22_CKPT/wan22_low \
    > --fsdp.cpu_offload=true
  4. Run inference loading both stage checkpoints:

    $python examples/diffusion/generate/generate.py \
    > -c examples/diffusion/generate/configs/generate_wan22.yaml

Each finetuning run only holds one of the two transformers on GPU — the recipe drops the unused one before sharding so an FSDP2 dp=8 setup on 8×80GB H100 fits a single 14B model plus its AdamW state. --fsdp.cpu_offload=true is recommended; it moves the sharded params and optimizer state to host RAM during the step boundary.

Try with NeMo AutoModel

1. Install (full instructions):

$pip install nemo-automodel

2. Clone the repo to get the example recipes:

$git clone https://github.com/NVIDIA-NeMo/Automodel.git
$cd Automodel

3. Run the recipe from inside the repo:

$torchrun --nproc-per-node=8 \
> examples/diffusion/finetune/finetune.py \
> -c examples/diffusion/finetune/wan2_2_t2v_flow.yaml \
> --model.stage=high_noise \
> --fsdp.cpu_offload=true

1. Pull the container and mount a checkpoint directory:

$docker run --gpus all -it --rm \
> --shm-size=8g \
> -v $(pwd)/checkpoints:/opt/Automodel/checkpoints \
> nvcr.io/nvidia/nemo-automodel:26.06.00

2. Navigate to the AutoModel directory (where the recipes are):

$cd /opt/Automodel

3. Run the recipe:

$torchrun --nproc-per-node=8 \
> examples/diffusion/finetune/finetune.py \
> -c examples/diffusion/finetune/wan2_2_t2v_flow.yaml \
> --model.stage=high_noise \
> --fsdp.cpu_offload=true

See the Installation Guide and Diffusion Fine-Tuning Guide.

Training

See the Diffusion Training and Fine-Tuning Guide and Dataset Preparation.

Hugging Face Model Cards