Wan 2.2 T2V-A14B
Wan 2.2 T2V-A14B
Wan 2.2 T2V-A14B is the successor to Wan 2.1, also a text-to-video flow-matching DiT. Its defining feature is a two-stage denoising pipeline: a high-noise transformer handles the early/noisy timesteps and a low-noise transformer_2 handles the later/cleaner timesteps, switching at boundary_ratio * num_train_timesteps (default 0.875). Each transformer is ~14B parameters, for ~28B total.
Available Models
- Wan2.2-T2V-A14B: two transformers, ~14B params each, boundary_ratio 0.875
Task
- Text-to-Video (T2V)
Example HF Models
Example Recipes
Two-stage finetuning workflow
Because each transformer is ~14B parameters, NeMo AutoModel finetunes them one at a time:
-
Preprocess once — produces a single cached
.metaset reusable across both stages: -
Finetune the high-noise stage (
pipe.transformer, sigma ∈ [boundary_ratio, 1.0]): -
Finetune the low-noise stage (
pipe.transformer_2, sigma ∈ [0.0, boundary_ratio]): -
Run inference loading both stage checkpoints:
Each finetuning run only holds one of the two transformers on GPU — the recipe drops the unused one before sharding so an FSDP2 dp=8 setup on 8×80GB H100 fits a single 14B model plus its AdamW state. --fsdp.cpu_offload=true is recommended; it moves the sharded params and optimizer state to host RAM during the step boundary.
Try with NeMo AutoModel
1. Install (full instructions):
2. Clone the repo to get the example recipes:
3. Run the recipe from inside the repo:
Run with Docker
1. Pull the container and mount a checkpoint directory:
2. Navigate to the AutoModel directory (where the recipes are):
3. Run the recipe:
See the Installation Guide and Diffusion Fine-Tuning Guide.
Training
See the Diffusion Training and Fine-Tuning Guide and Dataset Preparation.