Data Loading Concepts (Image)
Data Loading Concepts (Image)
Data Loading Concepts (Image)
This page covers the core concepts for loading and managing image datasets in NeMo Curator.
NeMo Curator loads image datasets from tar archives for scalable, distributed image curation. The ImageReaderStage reads only JPEG images from input .tar files, ignoring other content.
Example input directory structure:
What gets loaded:
.tar files: Tar archives containing JPEG images (.jpg)WebDataset Format Support: If your tar archives follow the WebDataset format and contain additional files (captions as .txt, metadata as .json), the ImageReaderStage will only extract JPEG images. Other file types (.txt, .json, etc.) are automatically ignored during loading.
Each record is identified by a unique ID (e.g., 000000031), used as the prefix for all files belonging to that record.
Example:
The ImageReaderStage uses NVIDIA DALI for efficient, GPU-accelerated loading and preprocessing of JPEG images from tar files. DALI enables: