Diffusion Dataset Preparation#
Introduction#
Diffusion model training in NeMo AutoModel requires pre-encoded .meta files rather than raw images or videos. During preprocessing, a VAE encodes visual data into latent representations and a text encoder produces text embeddings. These are saved as .meta files so that training operates entirely in latent space, avoiding the need to load heavy encoder models during training.
Input Data Format#
Images#
Place your images in a directory. Supported formats: jpg, jpeg, png, webp, bmp.
Captions can be provided in several formats:
Sidecar JSON (default for images): A
.jsonfile alongside each image with a caption fieldJSONL: A
.jsonlfile withinternvlorusrcaption fields
Videos#
Place your videos in a directory. Supported formats: mp4, avi, mov, mkv, webm.
Captions can be provided in several formats:
Sidecar JSON (
--caption_format sidecar): A.jsonfile alongside each videometa.json (
--caption_format meta_json): A singlemeta.jsonmanifest in the video directoryJSONL (
--caption_format jsonl): A.jsonlfile with captions
If no caption is found for a sample, the filename (with underscores replaced by spaces) is used as a fallback.
Preprocessing#
NeMo AutoModel includes a unified preprocessing tool at tools/diffusion/preprocessing_multiprocess.py that encodes raw images and videos into cache files compatible with the multiresolution dataloader. It uses model-specific processors from tools/diffusion/processors/ to handle VAE encoding, text embedding, and cache data formatting for each supported model.
The tool automatically distributes work across all available GPUs using multiprocessing, with one worker per GPU.
Available Processors#
Processor |
Media Type |
Model |
|---|---|---|
|
Image |
FLUX.1-dev |
|
Video |
Wan 2.1 |
|
Video |
HunyuanVideo 1.5 |
You can list all registered processors with:
python -m tools.diffusion.preprocessing_multiprocess --list_processors
Image Preprocessing (FLUX)#
python -m tools.diffusion.preprocessing_multiprocess image \
--image_dir /path/to/images \
--output_dir /path/to/cache \
--processor flux \
--resolution_preset 512p
Video Preprocessing (Wan 2.1)#
Video mode (encodes the full video as a single sample, recommended for training):
python -m tools.diffusion.preprocessing_multiprocess video \
--video_dir /path/to/videos \
--output_dir /path/to/cache \
--processor wan \
--resolution_preset 512p \
--caption_format sidecar
Frames mode (extracts evenly-spaced frames, each becomes a separate sample):
python -m tools.diffusion.preprocessing_multiprocess video \
--video_dir /path/to/videos \
--output_dir /path/to/cache \
--processor wan \
--mode frames \
--num_frames 40 \
--resolution_preset 512p
Video Preprocessing (HunyuanVideo)#
python -m tools.diffusion.preprocessing_multiprocess video \
--video_dir /path/to/videos \
--output_dir /path/to/cache \
--processor hunyuan \
--target_frames 121 \
--caption_format meta_json
Key Arguments#
Common arguments:
Argument |
Description |
|---|---|
|
Processor name ( |
|
HuggingFace model name (uses processor default if omitted) |
|
Output directory for cached data |
|
Number of samples per metadata shard (default: 10000) |
Image-specific arguments:
Argument |
Description |
|---|---|
|
Input image directory |
|
Resolution preset: |
|
Custom pixel budget (alternative to preset) |
|
Caption field in JSONL files ( |
|
Verify latents can be decoded back |
Video-specific arguments:
Argument |
Description |
|---|---|
|
Input video directory |
|
|
|
Number of frames to extract in |
|
Target frame count (for example, 121 for HunyuanVideo 4n+1 constraint) |
|
Resolution preset for bucketing |
|
Explicit target size (disables bucketing) |
|
Interpolation: |
|
Enable/disable center cropping (default: enabled) |
|
Caption source: |
|
Field name for captions (default: |
|
Output format: |
Output Format#
The preprocessing tool produces a cache directory organized by resolution bucket:
/path/to/cache/
βββ 512x512/
β βββ <hash1>.meta
β βββ <hash2>.meta
β βββ ...
βββ 832x480/
β βββ ...
βββ metadata.json # Global config (processor, model, total items)
βββ metadata_shard_0000.json # Per-sample metadata (paths, resolutions, captions)
Each cache file (.meta or .pt) contains:
Encoded latents β VAE latent representations of the image or video
Text embeddings β Pre-computed from the modelβs text encoder
First frame β Reference image for image-to-video conditioning (video mode only)
Image embeddings β For models that support i2v conditioning (video mode only)
Metadata β Original and bucket resolutions, caption, source path
Multiresolution Bucketing#
NeMo AutoModel supports multiresolution training through bucketed sampling. This groups samples by their spatial resolution so that each batch contains samples of the same size, avoiding padding waste.
During preprocessing, the --resolution_preset argument controls the pixel budget used for bucketing. Available presets: 256p, 512p, 768p, 1024p, 1536p. Alternatively, use --max_pixels for a custom pixel budget, or --height/--width to disable bucketing and use a fixed resolution.
During training, the dataloader uses these key configuration parameters:
base_resolution: The target resolution used for bucketing (for example,[512, 512])The
SequentialBucketSamplergroups samples by resolution bucketdynamic_batch_size: Whentrue, adjusts batch size per resolution bucket to maintain constant memory usage
YAML Configuration#
Video Dataloader (Wan 2.1 / HunyuanVideo)#
Used for text-to-video models. Set model_type to match your model (wan or hunyuan):
data:
dataloader:
_target_: nemo_automodel.components.datasets.diffusion.build_video_multiresolution_dataloader
cache_dir: /path/to/processed_meta
model_type: wan # or "hunyuan"
base_resolution: [512, 512]
dynamic_batch_size: false
shuffle: true
drop_last: false
num_workers: 0
Image Dataloader (FLUX)#
Used for text-to-image models:
data:
dataloader:
_target_: nemo_automodel.components.datasets.diffusion.build_text_to_image_multiresolution_dataloader
cache_dir: /path/to/processed_meta
train_text_encoder: false
num_workers: 0
base_resolution: [512, 512]
dynamic_batch_size: false
shuffle: true
drop_last: false
Tip
Supported image resolutions for FLUX include [256, 256], [512, 512], and [1024, 1024]. While a 1:1 aspect ratio is currently used as a proxy for the closest image size, the implementation is designed to support multiple aspect ratios.