Image Data Loading#

Load image data for curation using NeMo Curator. The primary supported format is tar archives containing JPEG images, which enables efficient distributed processing of large-scale image datasets.

How it Works#

NeMo Curator’s image data loading uses a pipeline-based approach optimized for large-scale, distributed curation workflows:

File Partitioning: FilePartitioningStage distributes .tar files across workers for parallel processing.
High-Performance Reading: ImageReaderStage uses NVIDIA DALI to accelerate image loading, decoding, and batching on GPU with CPU fallback.
Tar Archive Format: Processes sharded .tar archives containing JPEG images (other file types are ignored during loading).
Batch Processing: Images are processed in ImageBatch objects containing decoded images, metadata, and processing results.

The result is a stream of ImageBatch objects ready for embedding, classification, and filtering in downstream pipeline stages.

Options#

Tar Archive Pipeline

Load and process JPEG images from tar archives using FilePartitioningStage and ImageReaderStage for scalable distributed curation.

FilePartitioningStage ImageReaderStage DALI-accelerated

Loading Images from Tar Archives