Diffusion Model Fine-Tuning with NeMo AutoModel
Diffusion Model Fine-Tuning with NeMo AutoModel
Introduction
Diffusion models generate images and videos by learning to reverse a noise process β starting from random noise and iteratively refining it into coherent visual output guided by a text prompt. Pretrained diffusion models (like FLUX.1-dev for images or Wan 2.1 for video) produce impressive general-purpose results, but they know nothing about your particular visual domain, style, or subject matter. Fine-tuning bridges that gap β you adapt the model on your own data so it produces outputs that match your requirements, without the cost of training from scratch.
Under the hood, NeMo AutoModel uses flow matching, a modern generative framework that learns to transform noise into data by regressing a velocity field along straight interpolation paths. It integrates with Hugging Face Diffusers to provide distributed fine-tuning for text-to-image and text-to-video models. This guide walks you through the process end-to-end β from installation through training and inference β using Wan 2.1 T2V 1.3B as a running example.
Workflow Overview
For model-specific configuration (FLUX.1-dev, HunyuanVideo), see Model-Specific Notes.
Supported Models
All models use FSDP2 for distributed training and flow matching for loss computation.
Install NeMo AutoModel
Alternatively, if you run into dependency or driver issues, use the pre-built Docker container:
Docker users: Checkpoints are lost when the container exits unless you bind-mount the checkpoint directory to the host. See Install with NeMo Docker Container and Save Checkpoints When Using Docker.
For the full set of installation methods, see the installation guide.
Prepare Your Dataset
Diffusion models operate in latent space β a compressed representation of visual data β rather than directly on raw images or videos. To avoid re-encoding data on every training step, the preprocessing pipeline encodes all inputs ahead of time and saves them as .meta files.
Each .meta file contains:
- Latent representations produced by a VAE (Variational Autoencoder) from the raw visual data
- Text embeddings produced by a text encoder from the associated captions/prompts
Fine-tuning then operates entirely on these pre-encoded .meta files, which is significantly faster than encoding on the fly.
Preprocess your data using the built-in tool at tools/diffusion/preprocessing_multiprocess.py. The script provides image and video subcommands:
Video preprocessing (using Wan 2.1 as a running example):
Image preprocessing (FLUX):
Video preprocessing (HunyuanVideo):
For the full set of arguments and input format details, see the Diffusion Dataset Preparation guide.
Configure Your Training Recipe
Fine-tuning is driven by two components:
- A recipe script (e.g.,
train.py) β the Python entry point that orchestrates the training loop: loading the model, building the dataloader, running forward/backward passes, computing the flow matching loss, checkpointing, and logging. - A YAML configuration file β a text file in YAML format that specifies all settings the recipe uses: which model to fine-tune, where the data lives, optimizer hyperparameters, parallelism strategy, etc. You customize training by editing this file rather than modifying code, allowing you to scale from 1 to 100s of GPUs seamlessly.
Below is the annotated wan2_1_t2v_flow.yaml, with each section explained:
Config Field Reference
Fine-Tune the Model
Launch fine-tuning with torchrun:
Adjust --nproc-per-node to match the number of GPUs on your node, and ensure fsdp.dp_size in the YAML matches.
Multi-Node Training
When a single node doesnβt provide enough GPUs or memory for your workload, you can scale training across multiple nodes. NeMo AutoModel handles multi-node distributed training through torchrun rendezvous and FSDP2 β the same recipe script works on one node or many.
YAML Configuration Changes
The main change is in the fsdp section. Set dp_size to the total number of GPUs across all nodes, and optionally increase dp_replicate_size for gradient replication across nodes.
For example, to train on 2 nodes with 8 GPUs each (16 GPUs total):
A complete multi-node config is provided at wan2_1_t2v_flow_multinode.yaml.
Launch with torchrun
Run the following command on each node, setting NODE_RANK to 0 on the first node, 1 on the second, and so on:
Model-Specific Notes
Use the table below to pick the right model for your use case:
Wan 2.1 T2V 1.3B
- Adapter type:
simple - Dataloader:
build_video_multiresolution_dataloaderwithmodel_type: wan - Config: wan2_1_t2v_flow.yaml
FLUX.1-dev (Text-to-Image)
- Adapter type:
flux - Dataloader:
build_text_to_image_multiresolution_dataloader - Key differences:
- Uses
pipeline_specto specify the transformer architecture: - Requires
guidance_scalein adapter kwargs: - Uses
logit_normaltimestep sampling instead ofuniform
- Uses
- Config: flux_t2i_flow.yaml
HunyuanVideo 1.5
- Adapter type:
hunyuan - Dataloader:
build_video_multiresolution_dataloaderwithmodel_type: hunyuan - Key differences:
- Requires
activation_checkpointing: truein FSDP config due to model size - Uses condition latents in adapter kwargs:
- Uses
logit_normaltimestep sampling
- Requires
- Config: hunyuan_t2v_flow.yaml
Generation / Inference
Once training is complete, you can use the model to generate images or videos from text prompts. This step is called inference β as opposed to training, where the model learns from data, inference is where it produces new outputs.
In diffusion models, generation works by starting from random noise and iteratively denoising it, guided by your text prompt, until a clean image or video emerges.
The generation script (generate.py) handles this: it loads your model weights (pretrained or fine-tuned), configures the diffusion sampler, and produces outputs for one or more prompts.
Single-GPU (Wan 2.1 1.3B):
Multi-GPU (Wan 2.1 1.3B):
Wan 2.1 supports tensor parallelism for inference, which shards the transformer across GPUs to reduce per-GPU memory. Pass the distributed config via CLI overrides:
With a fine-tuned checkpoint:
FLUX image generation:
HunyuanVideo:
Available Generation Configs
You can use --model.checkpoint ./checkpoints/LATEST to automatically load the most recent checkpoint.