nemo_automodel.components.datasets.diffusion.text_to_video_dataset#

Module Contents#

Classes#

TextToVideoDataset

Text-to-Video dataset with multiresolution bucket organization.

Functions#

load_optional_video_fields

Extract optional model-specific fields, moving to device.

collate_optional_video_fields

Concatenate optional video fields present in batch into result dict.

Data#

API#

nemo_automodel.components.datasets.diffusion.text_to_video_dataset.VIDEO_OPTIONAL_FIELDS#

(‘text_mask’, ‘text_embeddings_2’, ‘text_mask_2’, ‘image_embeds’)

nemo_automodel.components.datasets.diffusion.text_to_video_dataset.load_optional_video_fields(data: dict, device: str = 'cpu') dict#

Extract optional model-specific fields, moving to device.

nemo_automodel.components.datasets.diffusion.text_to_video_dataset.collate_optional_video_fields(
batch: List[Dict],
result: dict,
) None#

Concatenate optional video fields present in batch into result dict.

class nemo_automodel.components.datasets.diffusion.text_to_video_dataset.TextToVideoDataset(
cache_dir: str,
model_type: str = 'wan',
device: str = 'cpu',
)#

Bases: nemo_automodel.components.datasets.diffusion.base_dataset.BaseMultiresolutionDataset

Text-to-Video dataset with multiresolution bucket organization.

Loads preprocessed .meta files organized by resolution bucket. Compatible with SequentialBucketSampler for multiresolution training.

Initialization

Parameters:
  • cache_dir – Directory containing preprocessed cache (metadata.json + shards + WxH/*.meta)

  • model_type – Model type for model-specific fields (“wan”, “hunyuan”, etc.)

  • device – Device to load tensors to

__getitem__(idx: int) Dict[str, torch.Tensor]#

Load a single video sample from its .meta file.