Diffusion Models

View as Markdown

Introduction

Diffusion models are a class of generative models that learn to produce images or videos by iteratively denoising samples from a noise distribution. NeMo AutoModel supports training diffusion models using flow matching, a framework that regresses velocity fields along straight interpolation paths between noise and data.

NeMo AutoModel integrates with Hugging Face Diffusers for model loading and generation, while providing its own distributed training infrastructure via the TrainDiffusionRecipe. This recipe handles FSDP2 parallelization, flow matching loss computation, multiresolution bucketed dataloading, and checkpoint management.

Supported Models

OwnerModelTaskArchitecture
Wan AIWan 2.1 T2VText-to-VideoDiT (Flow Matching)
Black Forest LabsFLUX.1-devText-to-ImageDiT (Flow Matching)
Hunyuan CommunityHunyuanVideo 1.5Text-to-VideoDiT (Flow Matching)
Qwen / Alibaba CloudQwen-ImageText-to-ImageDiT (Flow Matching)

Supported Workflows

  • Pretraining: Train from randomly initialized weights on large-scale datasets
  • Fine-tuning: Adapt pretrained model weights to a specific dataset or style
  • Generation: Run inference with pretrained or fine-tuned checkpoints

Dataset

Diffusion training requires pre-encoded .meta files containing VAE latents and text embeddings. Raw videos or images must be preprocessed before training. See the Diffusion Dataset Preparation guide.

Train Diffusion Models

For a complete walkthrough of training configuration, model-specific settings, and launch commands, see the Diffusion Training and Fine-Tuning Guide.