Image Data Loading#
Load image data for curation using NeMo Curator. The primary supported format is WebDataset, which enables efficient distributed processing and annotation of large-scale image-text datasets.
How it Works#
NeMo Curator’s image data loading is optimized for large-scale, distributed curation workflows:
Sharded WebDataset Format: Image, caption, and metadata files are grouped into sharded
.tar
archives, with corresponding.parquet
files for fast metadata access.Unified Metadata: Each record is uniquely identified and linked across image, caption, and metadata files, enabling efficient distributed processing.
High-Performance Loading: Optional
.idx
index files enable NVIDIA DALI to accelerate data loading, shuffling, and batching on GPU.Cloud and Local Storage: Datasets can be loaded from local disk or cloud storage (S3, GCS, Azure) using the same API.
Standardized Loader: The
ImageTextPairDataset.from_webdataset
method loads the entire dataset structure in one step—no need for separate downloaders, iterators, or extractors.
The result is a standardized ImageTextPairDataset
ready for embedding, classification, and filtering in downstream curation pipelines.
Options#
Load and process sharded image-text datasets in the WebDataset format for scalable distributed curation.