Diffusion Dataset Preparation
Introduction
Diffusion model training in NeMo AutoModel requires pre-encoded .meta files rather than raw images or videos. During preprocessing, a VAE encodes visual data into latent representations and a text encoder produces text embeddings. These are saved as .meta files so that training operates entirely in latent space, avoiding the need to load heavy encoder models during training.
Input Data Format
Images
Place your images in a directory. Supported formats: jpg, jpeg, png, webp, bmp.
Captions can be provided in several formats:
- Sidecar JSON (default for images): A
.jsonfile alongside each image with a caption field - JSONL: A
.jsonlfile withinternvlorusrcaption fields
Videos
Place your videos in a directory. Supported formats: mp4, avi, mov, mkv, webm.
Captions can be provided in several formats:
- Sidecar JSON (
--caption_format sidecar): A.jsonfile alongside each video - meta.json (
--caption_format meta_json): A singlemeta.jsonmanifest in the video directory - JSONL (
--caption_format jsonl): A.jsonlfile with captions
If no caption is found for a sample, the filename (with underscores replaced by spaces) is used as a fallback.
Preprocessing
NeMo AutoModel includes a unified preprocessing tool at tools/diffusion/preprocessing_multiprocess.py that encodes raw images and videos into cache files compatible with the multiresolution dataloader. It uses model-specific processors from tools/diffusion/processors/ to handle VAE encoding, text embedding, and cache data formatting for each supported model.
The tool automatically distributes work across all available GPUs using multiprocessing, with one worker per GPU.
Available Processors
You can list all registered processors with:
Image Preprocessing (FLUX)
Video Preprocessing (Wan 2.1)
Video mode (encodes the full video as a single sample, recommended for training):
Frames mode (extracts evenly-spaced frames, each becomes a separate sample):
Video Preprocessing (HunyuanVideo)
Key Arguments
Common arguments:
Image-specific arguments:
Video-specific arguments:
Output Format
The preprocessing tool produces a cache directory organized by resolution bucket:
Each cache file (.meta or .pt) contains:
- Encoded latents ā VAE latent representations of the image or video
- Text embeddings ā Pre-computed from the modelās text encoder
- First frame ā Reference image for image-to-video conditioning (video mode only)
- Image embeddings ā For models that support i2v conditioning (video mode only)
- Metadata ā Original and bucket resolutions, caption, source path
Multiresolution Bucketing
NeMo AutoModel supports multiresolution training through bucketed sampling. This groups samples by their spatial resolution so that each batch contains samples of the same size, avoiding padding waste.
During preprocessing, the --resolution_preset argument controls the pixel budget used for bucketing. Available presets: 256p, 512p, 768p, 1024p, 1536p. Alternatively, use --max_pixels for a custom pixel budget, or --height/--width to disable bucketing and use a fixed resolution.
During training, the dataloader uses these key configuration parameters:
base_resolution: The target resolution used for bucketing (for example,[512, 512])- The
SequentialBucketSamplergroups samples by resolution bucket dynamic_batch_size: Whentrue, adjusts batch size per resolution bucket to maintain constant memory usage
YAML Configuration
Video Dataloader (Wan 2.1 / HunyuanVideo)
Used for text-to-video models. Set model_type to match your model (wan or hunyuan):
Image Dataloader (FLUX)
Used for text-to-image models:
Supported image resolutions for FLUX include [256, 256], [512, 512], and [1024, 1024]. While a 1:1 aspect ratio is currently used as a proxy for the closest image size, the implementation is designed to support multiple aspect ratios.