Wan 2.2 T2V-A14B

Wan 2.2 T2V-A14B is the successor to Wan 2.1, also a text-to-video flow-matching DiT. Its defining feature is a two-stage denoising pipeline: a high-noise transformer handles the early/noisy timesteps and a low-noise transformer_2 handles the later/cleaner timesteps, switching at boundary_ratio * num_train_timesteps (default 0.875). Each transformer is ~14B parameters, for ~28B total.


Task	Text-to-Video
Architecture	DiT (Flow Matching), two-stage
Parameters	14B + 14B (high-noise + low-noise)
HF Org	Wan-AI

Available Models

Wan2.2-T2V-A14B: two transformers, ~14B params each, boundary_ratio 0.875

Task

Text-to-Video (T2V)

Example HF Models

Model	HF ID
Wan 2.2 T2V-A14B	`Wan-AI/Wan2.2-T2V-A14B-Diffusers`

Example Recipes

Recipe	Description
wan2_2_t2v_flow.yaml	Fine-tune — two-stage with `model.stage` knob
generate_wan22.yaml	Inference — loads both stage checkpoints

Two-stage finetuning workflow

Because each transformer is ~14B parameters, NeMo AutoModel finetunes them one at a time:

Preprocess once — produces a single cached .meta set reusable across both stages:

$ python -m tools.diffusion.preprocessing_multiprocess video \
>     --video_dir /path/to/videos --output_dir /path/to/wan22_cache \
>     --processor wan2.2 --caption_format meta_json --caption_field caption \
>     --resolution_preset 512p --target_frames 81

Finetune the high-noise stage (pipe.transformer, sigma ∈ [boundary_ratio, 1.0]):

$ torchrun --nproc-per-node=8 \
>     examples/diffusion/finetune/finetune.py \
>     -c examples/diffusion/finetune/wan2_2_t2v_flow.yaml \
>     --model.stage=high_noise \
>     --data.dataloader.cache_dir=/path/to/wan22_cache \
>     --checkpoint.checkpoint_dir=./WAN22_CKPT/wan22_high \
>     --fsdp.cpu_offload=true

Finetune the low-noise stage (pipe.transformer_2, sigma ∈ [0.0, boundary_ratio]):

$ torchrun --nproc-per-node=8 \
>     examples/diffusion/finetune/finetune.py \
>     -c examples/diffusion/finetune/wan2_2_t2v_flow.yaml \
>     --model.stage=low_noise \
>     --data.dataloader.cache_dir=/path/to/wan22_cache \
>     --checkpoint.checkpoint_dir=./WAN22_CKPT/wan22_low \
>     --fsdp.cpu_offload=true

Run inference loading both stage checkpoints:

$ python examples/diffusion/generate/generate.py \
>     -c examples/diffusion/generate/configs/generate_wan22.yaml

Each finetuning run only holds one of the two transformers on GPU — the recipe drops the unused one before sharding so an FSDP2 dp=8 setup on 8×80GB H100 fits a single 14B model plus its AdamW state. --fsdp.cpu_offload=true is recommended; it moves the sharded params and optimizer state to host RAM during the step boundary.

Try with NeMo AutoModel

1. Install (full instructions):

$ pip install nemo-automodel

2. Clone the repo to get the example recipes:

$ git clone https://github.com/NVIDIA-NeMo/Automodel.git
$ cd Automodel

3. Run the recipe from inside the repo:

$ torchrun --nproc-per-node=8 \
>   examples/diffusion/finetune/finetune.py \
>   -c examples/diffusion/finetune/wan2_2_t2v_flow.yaml \
>   --model.stage=high_noise \
>   --fsdp.cpu_offload=true

Run with Docker

1. Pull the container and mount a checkpoint directory:

$ docker run --gpus all -it --rm \
>   --shm-size=8g \
>   -v $(pwd)/checkpoints:/opt/Automodel/checkpoints \
>   nvcr.io/nvidia/nemo-automodel:26.06.00

2. Navigate to the AutoModel directory (where the recipes are):

$ cd /opt/Automodel

3. Run the recipe:

$ torchrun --nproc-per-node=8 \
>   examples/diffusion/finetune/finetune.py \
>   -c examples/diffusion/finetune/wan2_2_t2v_flow.yaml \
>   --model.stage=high_noise \
>   --fsdp.cpu_offload=true

See the Installation Guide and Diffusion Fine-Tuning Guide.

Training

See the Diffusion Training and Fine-Tuning Guide and Dataset Preparation.

Hugging Face Model Cards

Wan-AI/Wan2.2-T2V-A14B-Diffusers