Image Data Loading#
Load image data for curation using NeMo Curator. The primary supported format is WebDataset, which enables efficient distributed processing and annotation of large-scale image-text datasets.
How it Works#
NeMo Curator’s image data loading is optimized for large-scale, distributed curation workflows:
Sharded WebDataset Format: Image, caption, and metadata files are grouped into sharded
.tararchives, with corresponding.parquetfiles for fast metadata access.Unified Metadata: Each record is uniquely identified and linked across image, caption, and metadata files, enabling efficient distributed processing.
High-Performance Loading: Optional
.idxindex files enable NVIDIA DALI to accelerate data loading, shuffling, and batching on GPU.Cloud and Local Storage: Datasets can be loaded from local disk or cloud storage (S3, GCS, Azure) using the same API.
Standardized Loader: The
ImageTextPairDataset.from_webdatasetmethod loads the entire dataset structure in one step—no need for separate downloaders, iterators, or extractors.
The result is a standardized ImageTextPairDataset ready for embedding, classification, and filtering in downstream curation pipelines.
Options#
Load and process sharded image-text datasets in the WebDataset format for scalable distributed curation.