Diffusion Model Fine-Tuning with NeMo AutoModel

View as Markdown

Introduction

Diffusion models generate images and videos by learning to reverse a noise process β€” starting from random noise and iteratively refining it into coherent visual output guided by a text prompt. Pretrained diffusion models (like FLUX.1-dev for images or Wan 2.1 for video) produce impressive general-purpose results, but they know nothing about your particular visual domain, style, or subject matter. Fine-tuning bridges that gap β€” you adapt the model on your own data so it produces outputs that match your requirements, without the cost of training from scratch.

Under the hood, NeMo AutoModel uses flow matching, a modern generative framework that learns to transform noise into data by regressing a velocity field along straight interpolation paths. It integrates with Hugging Face Diffusers to provide distributed fine-tuning for text-to-image and text-to-video models. This guide walks you through the process end-to-end β€” from installation through training and inference β€” using Wan 2.1 T2V 1.3B as a running example.

Workflow Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. Install β”‚--->β”‚ 2. Prepare β”‚--->β”‚ 3. Configure β”‚--->β”‚ 4. Train β”‚--->β”‚ 5. Generate β”‚
β”‚ β”‚ β”‚ Data β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ pip install β”‚ β”‚ Encode to β”‚ β”‚ YAML recipe β”‚ β”‚ torchrun β”‚ β”‚ Run inferenceβ”‚
β”‚ or Docker β”‚ β”‚ .meta files β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ with ckpt β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
StepSectionWhat You Do
1. InstallInstall NeMo AutoModelInstall the package via pip or Docker
2. Prepare DataPrepare Your DatasetEncode raw images/videos into .meta latent files
3. ConfigureConfigure Your Training RecipeWrite a YAML config specifying model, data, and training settings
4. TrainFine-Tune the ModelLaunch training with torchrun on a single node
4b. Multi-NodeMulti-Node TrainingScale training across multiple nodes
5. GenerateGeneration / InferenceRun inference using the fine-tuned checkpoint

For model-specific configuration (FLUX.1-dev, HunyuanVideo), see Model-Specific Notes.

Supported Models

ModelHF Model IDTaskParametersExample Config
Wan 2.1 T2V 1.3BWan-AI/Wan2.1-T2V-1.3B-DiffusersText-to-Video1.3Bwan2_1_t2v_flow.yaml
Wan 2.2 T2V-A14B (two-stage)Wan-AI/Wan2.2-T2V-A14B-DiffusersText-to-Video14B + 14Bwan2_2_t2v_flow.yaml
FLUX.1-devblack-forest-labs/FLUX.1-devText-to-Image12Bflux_t2i_flow.yaml
HunyuanVideo 1.5hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-720p_t2vText-to-Videoβ€”hunyuan_t2v_flow.yaml

All models use FSDP2 for distributed training and flow matching for loss computation.

Install NeMo AutoModel

$pip3 install nemo-automodel

Alternatively, if you run into dependency or driver issues, use the pre-built Docker container:

$docker pull nvcr.io/nvidia/nemo-automodel:26.06.00
$docker run --gpus all -it --rm --shm-size=8g nvcr.io/nvidia/nemo-automodel:26.06.00

Docker users: Checkpoints are lost when the container exits unless you bind-mount the checkpoint directory to the host. See Install with NeMo Docker Container and Save Checkpoints When Using Docker.

For the full set of installation methods, see the installation guide.

Prepare Your Dataset

Diffusion models operate in latent space β€” a compressed representation of visual data β€” rather than directly on raw images or videos. To avoid re-encoding data on every training step, the preprocessing pipeline encodes all inputs ahead of time and saves them as .meta files.

Each .meta file contains:

  • Latent representations produced by a VAE (Variational Autoencoder) from the raw visual data
  • Text embeddings produced by a text encoder from the associated captions/prompts

Fine-tuning then operates entirely on these pre-encoded .meta files, which is significantly faster than encoding on the fly.

Preprocess your data using the built-in tool at tools/diffusion/preprocessing_multiprocess.py. The script provides image and video subcommands:

Video preprocessing (using Wan 2.1 as a running example):

$python -m tools.diffusion.preprocessing_multiprocess video \
> --video_dir /data/videos \
> --output_dir /cache \
> --processor wan \
> --resolution_preset 512p \
> --caption_format sidecar

Image preprocessing (FLUX):

$python -m tools.diffusion.preprocessing_multiprocess image \
> --image_dir /data/images \
> --output_dir /cache \
> --processor flux

Video preprocessing (HunyuanVideo):

$python -m tools.diffusion.preprocessing_multiprocess video \
> --video_dir /data/videos \
> --output_dir /cache \
> --processor hunyuan \
> --target_frames 121 \
> --caption_format meta_json

For the full set of arguments and input format details, see the Diffusion Dataset Preparation guide.

Configure Your Training Recipe

Fine-tuning is driven by two components:

  1. A recipe script (e.g., train.py) β€” the Python entry point that orchestrates the training loop: loading the model, building the dataloader, running forward/backward passes, computing the flow matching loss, checkpointing, and logging.
  2. A YAML configuration file β€” a text file in YAML format that specifies all settings the recipe uses: which model to fine-tune, where the data lives, optimizer hyperparameters, parallelism strategy, etc. You customize training by editing this file rather than modifying code, allowing you to scale from 1 to 100s of GPUs seamlessly.

Below is the annotated wan2_1_t2v_flow.yaml, with each section explained:

1seed: 42
2
3# Weights & Biases experiment tracking
4wandb:
5 project: wan-t2v-flow-matching
6 mode: online
7 name: wan2_1_t2v_fm_v2
8
9dist_env:
10 backend: nccl
11 timeout_minutes: 30
12
13# Model configuration
14# pretrained_model_name_or_path: Hugging Face model ID
15# mode: "finetune" loads pretrained weights and adapts them to your dataset
16model:
17 pretrained_model_name_or_path: Wan-AI/Wan2.1-T2V-1.3B-Diffusers
18 mode: finetune
19
20# Training schedule
21step_scheduler:
22 global_batch_size: 8 # Effective batch size across all GPUs
23 local_batch_size: 1 # Per-GPU batch size (gradient accumulation = global/local/num_gpus)
24 ckpt_every_steps: 1000 # Checkpoint frequency
25 num_epochs: 100
26 log_every: 2 # Log metrics every N steps
27
28# Data: uses pre-encoded .meta files
29data:
30 dataloader:
31 _target_: nemo_automodel.components.datasets.diffusion.build_video_multiresolution_dataloader
32 cache_dir: PATH_TO_YOUR_DATA
33 model_type: wan # "wan" for Wan 2.1, "hunyuan" for HunyuanVideo
34 base_resolution: [512, 512]
35 dynamic_batch_size: false
36 shuffle: true
37 drop_last: false
38 num_workers: 0
39
40# Optimizer
41optim:
42 learning_rate: 5e-6
43 optimizer:
44 weight_decay: 0.01
45 betas: [0.9, 0.999]
46
47# Learning rate scheduler
48lr_scheduler:
49 lr_decay_style: cosine
50 lr_warmup_steps: 0
51 min_lr: 1e-6
52
53# Flow matching configuration
54flow_matching:
55 adapter_type: "simple" # Model-specific adapter (simple, flux, hunyuan)
56 adapter_kwargs: {}
57 timestep_sampling: "uniform" # How timesteps are sampled during training
58 logit_mean: 0.0
59 logit_std: 1.0
60 flow_shift: 3.0 # Shifts the flow schedule
61 mix_uniform_ratio: 0.1
62 sigma_min: 0.0
63 sigma_max: 1.0
64 num_train_timesteps: 1000
65 i2v_prob: 0.3 # Probability of image-to-video conditioning
66 use_loss_weighting: true
67 log_interval: 100
68 summary_log_interval: 10
69
70# FSDP2 distributed training
71fsdp:
72 tp_size: 1 # Tensor parallelism
73 cp_size: 1 # Context parallelism
74 pp_size: 1 # Pipeline parallelism
75 dp_replicate_size: 1
76 dp_size: 8 # Data parallelism (number of GPUs)
77
78# Checkpointing
79checkpoint:
80 enabled: true
81 checkpoint_dir: PATH_TO_YOUR_CKPT_DIR
82 model_save_format: torch_save
83 save_consolidated: false
84 restore_from: null

Config Field Reference

SectionRequired?What to Change
modelYesSet pretrained_model_name_or_path to the Hugging Face model ID. Set mode: finetune.
step_schedulerYesglobal_batch_size is the effective batch size across all GPUs. ckpt_every_steps controls checkpoint frequency.
dataYesSet cache_dir to the path containing your preprocessed .meta files. Change model_type and _target_ for different models (see Model-Specific Notes).
optimYeslearning_rate: 5e-6 is a good default for fine-tuning.
flow_matchingYesadapter_type must match the model (simple for Wan, flux for FLUX, hunyuan for HunyuanVideo).
fsdpYesSet dp_size to the number of GPUs on your node.
checkpointRecommendedSet checkpoint_dir to a persistent path, especially in Docker.
wandbOptionalConfigure to enable Weights & Biases logging.

Fine-Tune the Model

Launch fine-tuning with torchrun:

$torchrun --nproc-per-node=8 \
> examples/diffusion/finetune/finetune.py \
> -c examples/diffusion/finetune/wan2_1_t2v_flow.yaml

Adjust --nproc-per-node to match the number of GPUs on your node, and ensure fsdp.dp_size in the YAML matches.

Multi-Node Training

When a single node doesn’t provide enough GPUs or memory for your workload, you can scale training across multiple nodes. NeMo AutoModel handles multi-node distributed training through torchrun rendezvous and FSDP2 β€” the same recipe script works on one node or many.

YAML Configuration Changes

The main change is in the fsdp section. Set dp_size to the total number of GPUs across all nodes, and optionally increase dp_replicate_size for gradient replication across nodes.

For example, to train on 2 nodes with 8 GPUs each (16 GPUs total):

1fsdp:
2 tp_size: 1
3 cp_size: 1
4 pp_size: 1
5 dp_replicate_size: 2 # Replicate across 2 nodes for robustness
6 dp_size: 16 # Total GPUs: 2 nodes Γ— 8 GPUs

A complete multi-node config is provided at wan2_1_t2v_flow_multinode.yaml.

Launch with torchrun

Run the following command on each node, setting NODE_RANK to 0 on the first node, 1 on the second, and so on:

$export MASTER_ADDR=node0.hostname # hostname or IP of the first node
$export MASTER_PORT=29500
$export NODE_RANK=0 # 0 on master, 1 on second node, etc.
$
$torchrun \
> --nnodes=2 \
> --nproc-per-node=8 \
> --node_rank=${NODE_RANK} \
> --rdzv_backend=c10d \
> --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
> examples/diffusion/finetune/finetune.py \
> -c examples/diffusion/finetune/wan2_1_t2v_flow_multinode.yaml

Model-Specific Notes

Use the table below to pick the right model for your use case:

Use CaseModelWhy Choose It
Video generation on limited hardwareWan 2.1 T2V 1.3BSmallest model (1.3B params) β€” fast iteration, fits on a single A100 40GB
High-quality image generationFLUX.1-devState-of-the-art text-to-image with 12B params and guidance-based control
High-quality video generationHunyuanVideo 1.5Larger video model with condition-latent support for richer motion and detail

Wan 2.1 T2V 1.3B

  • Adapter type: simple
  • Dataloader: build_video_multiresolution_dataloader with model_type: wan
  • Config: wan2_1_t2v_flow.yaml

FLUX.1-dev (Text-to-Image)

  • Adapter type: flux
  • Dataloader: build_text_to_image_multiresolution_dataloader
  • Key differences:
    • Uses pipeline_spec to specify the transformer architecture:
      1model:
      2 pipeline_spec:
      3 transformer_cls: "FluxTransformer2DModel"
      4 subfolder: "transformer"
      5 load_full_pipeline: false
    • Requires guidance_scale in adapter kwargs:
      1flow_matching:
      2 adapter_type: "flux"
      3 adapter_kwargs:
      4 guidance_scale: 3.5
      5 use_guidance_embeds: true
    • Uses logit_normal timestep sampling instead of uniform
  • Config: flux_t2i_flow.yaml

HunyuanVideo 1.5

  • Adapter type: hunyuan
  • Dataloader: build_video_multiresolution_dataloader with model_type: hunyuan
  • Key differences:
    • Requires activation_checkpointing: true in FSDP config due to model size
    • Uses condition latents in adapter kwargs:
      1flow_matching:
      2 adapter_type: "hunyuan"
      3 adapter_kwargs:
      4 use_condition_latents: true
      5 default_image_embed_shape: [729, 1152]
    • Uses logit_normal timestep sampling
  • Config: hunyuan_t2v_flow.yaml

Generation / Inference

Once training is complete, you can use the model to generate images or videos from text prompts. This step is called inference β€” as opposed to training, where the model learns from data, inference is where it produces new outputs.

In diffusion models, generation works by starting from random noise and iteratively denoising it, guided by your text prompt, until a clean image or video emerges.

The generation script (generate.py) handles this: it loads your model weights (pretrained or fine-tuned), configures the diffusion sampler, and produces outputs for one or more prompts.

Single-GPU (Wan 2.1 1.3B):

$python examples/diffusion/generate/generate.py \
> -c examples/diffusion/generate/configs/generate_wan.yaml

Multi-GPU (Wan 2.1 1.3B):

Wan 2.1 supports tensor parallelism for inference, which shards the transformer across GPUs to reduce per-GPU memory. Pass the distributed config via CLI overrides:

$torchrun --nproc-per-node=8 \
> examples/diffusion/generate/generate.py \
> -c examples/diffusion/generate/configs/generate_wan.yaml \
> --distributed.backend nccl \
> --distributed.parallel_scheme.transformer.tp_size 8

With a fine-tuned checkpoint:

$python examples/diffusion/generate/generate.py \
> -c examples/diffusion/generate/configs/generate_wan.yaml \
> --model.checkpoint ./checkpoints/step_1000 \
> --inference.prompts '["A dog running on a beach"]'

FLUX image generation:

$python examples/diffusion/generate/generate.py \
> -c examples/diffusion/generate/configs/generate_flux.yaml

HunyuanVideo:

$python examples/diffusion/generate/generate.py \
> -c examples/diffusion/generate/configs/generate_hunyuan.yaml

Available Generation Configs

ConfigModelOutputGPUs
generate_wan.yamlWan 2.1 1.3BVideo1
generate_flux.yamlFLUX.1-devImage1
generate_hunyuan.yamlHunyuanVideoVideo1

You can use --model.checkpoint ./checkpoints/LATEST to automatically load the most recent checkpoint.

Hardware Requirements

ComponentMinimumRecommended
GPUA100 40GBA100 80GB / H100
GPUs48
RAM128 GB256 GB+
Storage500 GB SSD2 TB NVMe