nemo_automodel.components.datasets.diffusion.collate_fns#

Collate functions and dataloader builders for multiresolution diffusion training.

Supports both image and video pipelines via the FlowMatchingPipeline expected batch format.

Module Contents#

Functions#

collate_fn_production

Production collate function with verification.

collate_fn_text_to_image

Text-to-image collate function that transforms multiresolution batch output to match FlowMatchingPipeline expected format.

_build_multiresolution_dataloader_core

Internal helper: create sampler + DataLoader from dataset and collate fn.

build_text_to_image_multiresolution_dataloader

Build a text-to-image multiresolution dataloader for TrainDiffusionRecipe.

collate_fn_video

Video-compatible collate function for multiresolution video training.

build_video_multiresolution_dataloader

Build a multiresolution video dataloader for TrainDiffusionRecipe.

Data#

API#

nemo_automodel.components.datasets.diffusion.collate_fns.logger#

β€˜getLogger(…)’

nemo_automodel.components.datasets.diffusion.collate_fns.collate_fn_production(batch: List[Dict]) Dict#

Production collate function with verification.

nemo_automodel.components.datasets.diffusion.collate_fns.collate_fn_text_to_image(
batch: List[Dict],
) Dict#

Text-to-image collate function that transforms multiresolution batch output to match FlowMatchingPipeline expected format.

Parameters:

batch – List of samples from TextToImageDataset

Returns:

Dict compatible with FlowMatchingPipeline.step()

nemo_automodel.components.datasets.diffusion.collate_fns._build_multiresolution_dataloader_core(
*,
dataset,
collate_fn: Callable,
batch_size: int,
dp_rank: int,
dp_world_size: int,
base_resolution: Tuple[int, int] = (512, 512),
drop_last: bool = True,
shuffle: bool = True,
dynamic_batch_size: bool = False,
num_workers: int = 4,
pin_memory: bool = True,
prefetch_factor: int = 2,
) Tuple[torchdata.stateful_dataloader.StatefulDataLoader, nemo_automodel.components.datasets.diffusion.sampler.SequentialBucketSampler]#

Internal helper: create sampler + DataLoader from dataset and collate fn.

nemo_automodel.components.datasets.diffusion.collate_fns.build_text_to_image_multiresolution_dataloader(
*,
cache_dir: str,
train_text_encoder: bool = False,
batch_size: int = 1,
dp_rank: int = 0,
dp_world_size: int = 1,
base_resolution: Tuple[int, int] = (256, 256),
drop_last: bool = True,
shuffle: bool = True,
dynamic_batch_size: bool = False,
num_workers: int = 4,
pin_memory: bool = True,
prefetch_factor: int = 2,
) Tuple[torchdata.stateful_dataloader.StatefulDataLoader, nemo_automodel.components.datasets.diffusion.sampler.SequentialBucketSampler]#

Build a text-to-image multiresolution dataloader for TrainDiffusionRecipe.

This wraps the existing TextToImageDataset and SequentialBucketSampler with a text-to-image collate function.

Parameters:
  • cache_dir – Directory containing preprocessed cache (metadata.json, shards, and resolution subdirs)

  • train_text_encoder – If True, returns tokens instead of embeddings

  • batch_size – Batch size per GPU

  • dp_rank – Data parallel rank

  • dp_world_size – Data parallel world size

  • base_resolution – Base resolution for dynamic batch sizing

  • drop_last – Drop incomplete batches

  • shuffle – Shuffle data

  • dynamic_batch_size – Scale batch size by resolution

  • num_workers – DataLoader workers

  • pin_memory – Pin memory for GPU transfer

  • prefetch_factor – Prefetch batches per worker

Returns:

Tuple of (DataLoader, SequentialBucketSampler)

nemo_automodel.components.datasets.diffusion.collate_fns.collate_fn_video(
batch: List[Dict],
model_type: str = 'wan',
) Dict#

Video-compatible collate function for multiresolution video training.

Concatenates video_latents (5D) and text_embeddings (3D) along the batch dim, matching the format expected by FlowMatchingPipeline with SimpleAdapter.

Parameters:
  • batch – List of samples from TextToVideoDataset

  • model_type – Model type for model-specific field handling

Returns:

Dict compatible with FlowMatchingPipeline.step()

nemo_automodel.components.datasets.diffusion.collate_fns.build_video_multiresolution_dataloader(
*,
cache_dir: str,
model_type: str = 'wan',
device: str = 'cpu',
batch_size: int = 1,
dp_rank: int = 0,
dp_world_size: int = 1,
base_resolution: Tuple[int, int] = (512, 512),
drop_last: bool = True,
shuffle: bool = True,
dynamic_batch_size: bool = False,
num_workers: int = 2,
pin_memory: bool = True,
prefetch_factor: int = 2,
) Tuple[torchdata.stateful_dataloader.StatefulDataLoader, nemo_automodel.components.datasets.diffusion.sampler.SequentialBucketSampler]#

Build a multiresolution video dataloader for TrainDiffusionRecipe.

Uses TextToVideoDataset with SequentialBucketSampler for bucket-based multiresolution video training (e.g. Wan, Hunyuan).

Parameters:
  • cache_dir – Directory containing preprocessed cache (metadata.json + shards + WxH/*.meta)

  • model_type – Model type (β€œwan”, β€œhunyuan”, etc.)

  • device – Device to load tensors to

  • batch_size – Batch size per GPU

  • dp_rank – Data parallel rank

  • dp_world_size – Data parallel world size

  • base_resolution – Base resolution for dynamic batch sizing

  • drop_last – Drop incomplete batches

  • shuffle – Shuffle data

  • dynamic_batch_size – Scale batch size by resolution

  • num_workers – DataLoader workers

  • pin_memory – Pin memory for GPU transfer

  • prefetch_factor – Prefetch batches per worker

Returns:

Tuple of (DataLoader, SequentialBucketSampler)