Image Data Loading#

Load image data for curation using NeMo Curator. The primary supported format is WebDataset, which enables efficient distributed processing and annotation of large-scale image-text datasets.

How it Works#

NeMo Curator’s image data loading is optimized for large-scale, distributed curation workflows:

Sharded WebDataset Format: Image, caption, and metadata files are grouped into sharded .tar archives, with corresponding .parquet files for fast metadata access.
Unified Metadata: Each record is uniquely identified and linked across image, caption, and metadata files, enabling efficient distributed processing.
High-Performance Loading: Optional .idx index files enable NVIDIA DALI to accelerate data loading, shuffling, and batching on GPU.
Cloud and Local Storage: Datasets can be loaded from local disk or cloud storage (S3, GCS, Azure) using the same API.
Standardized Loader: The ImageTextPairDataset.from_webdataset method loads the entire dataset structure in one step—no need for separate downloaders, iterators, or extractors.

The result is a standardized ImageTextPairDataset ready for embedding, classification, and filtering in downstream curation pipelines.

Options#

WebDataset

Load and process sharded image-text datasets in the WebDataset format for scalable distributed curation.

webdataset sharded distributed

WebDataset