`nemo_automodel.components.datasets.diffusion.collate_fns`#

Collate functions and dataloader builders for multiresolution diffusion training.

Supports both image and video pipelines via the FlowMatchingPipeline expected batch format.

Module Contents#

Functions#

`collate_fn_production`	Production collate function with verification.
`collate_fn_text_to_image`	Text-to-image collate function that transforms multiresolution batch output to match FlowMatchingPipeline expected format.
`_build_multiresolution_dataloader_core`	Internal helper: create sampler + DataLoader from dataset and collate fn.
`build_text_to_image_multiresolution_dataloader`	Build a text-to-image multiresolution dataloader for TrainDiffusionRecipe.
`collate_fn_video`	Video-compatible collate function for multiresolution video training.
`build_video_multiresolution_dataloader`	Build a multiresolution video dataloader for TrainDiffusionRecipe.

Data#

logger

API#

nemo_automodel.components.datasets.diffusion.collate_fns.logger#: ‘getLogger(…)’

nemo_automodel.components.datasets.diffusion.collate_fns.collate_fn_production(batch: List[Dict]) → Dict#: Production collate function with verification.

nemo_automodel.components.datasets.diffusion.collate_fns.collate_fn_text_to_image( batch: List[Dict], ) → Dict#

Text-to-image collate function that transforms multiresolution batch output to match FlowMatchingPipeline expected format.

Parameters:: batch – List of samples from TextToImageDataset
Returns:: Dict compatible with FlowMatchingPipeline.step()

nemo_automodel.components.datasets.diffusion.collate_fns._build_multiresolution_dataloader_core( *, dataset, collate_fn: Callable, batch_size: int, dp_rank: int, dp_world_size: int, base_resolution: Tuple[int, int] = (512, 512), drop_last: bool = True, shuffle: bool = True, dynamic_batch_size: bool = False, num_workers: int = 4, pin_memory: bool = True, prefetch_factor: int = 2, ) → Tuple[torchdata.stateful_dataloader.StatefulDataLoader, nemo_automodel.components.datasets.diffusion.sampler.SequentialBucketSampler]#: Internal helper: create sampler + DataLoader from dataset and collate fn.

nemo_automodel.components.datasets.diffusion.collate_fns.build_text_to_image_multiresolution_dataloader( *, cache_dir: str, train_text_encoder: bool = False, batch_size: int = 1, dp_rank: int = 0, dp_world_size: int = 1, base_resolution: Tuple[int, int] = (256, 256), drop_last: bool = True, shuffle: bool = True, dynamic_batch_size: bool = False, num_workers: int = 4, pin_memory: bool = True, prefetch_factor: int = 2, ) → Tuple[torchdata.stateful_dataloader.StatefulDataLoader, nemo_automodel.components.datasets.diffusion.sampler.SequentialBucketSampler]#

Build a text-to-image multiresolution dataloader for TrainDiffusionRecipe.

This wraps the existing TextToImageDataset and SequentialBucketSampler with a text-to-image collate function.

Parameters:

cache_dir – Directory containing preprocessed cache (metadata.json, shards, and resolution subdirs)
train_text_encoder – If True, returns tokens instead of embeddings
batch_size – Batch size per GPU
dp_rank – Data parallel rank
dp_world_size – Data parallel world size
base_resolution – Base resolution for dynamic batch sizing
drop_last – Drop incomplete batches
shuffle – Shuffle data
dynamic_batch_size – Scale batch size by resolution
num_workers – DataLoader workers
pin_memory – Pin memory for GPU transfer
prefetch_factor – Prefetch batches per worker

Returns:

Tuple of (DataLoader, SequentialBucketSampler)

nemo_automodel.components.datasets.diffusion.collate_fns.collate_fn_video( batch: List[Dict], model_type: str = 'wan', ) → Dict#

Video-compatible collate function for multiresolution video training.

Concatenates video_latents (5D) and text_embeddings (3D) along the batch dim, matching the format expected by FlowMatchingPipeline with SimpleAdapter.

Parameters:

batch – List of samples from TextToVideoDataset
model_type – Model type for model-specific field handling

Returns:

Dict compatible with FlowMatchingPipeline.step()

nemo_automodel.components.datasets.diffusion.collate_fns.build_video_multiresolution_dataloader( *, cache_dir: str, model_type: str = 'wan', device: str = 'cpu', batch_size: int = 1, dp_rank: int = 0, dp_world_size: int = 1, base_resolution: Tuple[int, int] = (512, 512), drop_last: bool = True, shuffle: bool = True, dynamic_batch_size: bool = False, num_workers: int = 2, pin_memory: bool = True, prefetch_factor: int = 2, ) → Tuple[torchdata.stateful_dataloader.StatefulDataLoader, nemo_automodel.components.datasets.diffusion.sampler.SequentialBucketSampler]#

Build a multiresolution video dataloader for TrainDiffusionRecipe.

Uses TextToVideoDataset with SequentialBucketSampler for bucket-based multiresolution video training (e.g. Wan, Hunyuan).

Parameters:

cache_dir – Directory containing preprocessed cache (metadata.json + shards + WxH/*.meta)
model_type – Model type (“wan”, “hunyuan”, etc.)
device – Device to load tensors to
batch_size – Batch size per GPU
dp_rank – Data parallel rank
dp_world_size – Data parallel world size
base_resolution – Base resolution for dynamic batch sizing
drop_last – Drop incomplete batches
shuffle – Shuffle data
dynamic_batch_size – Scale batch size by resolution
num_workers – DataLoader workers
pin_memory – Pin memory for GPU transfer
prefetch_factor – Prefetch batches per worker

Returns:

Tuple of (DataLoader, SequentialBucketSampler)

nemo_automodel.components.datasets.diffusion.collate_fns#

Module Contents#

Functions#

Data#

API#

`nemo_automodel.components.datasets.diffusion.collate_fns`#