Overview | NeMo Curator

Load image data for curation using NeMo Curator. The primary supported format is tar archives containing JPEG images, which enables efficient distributed processing of large-scale image datasets.

How it Works

NeMo Curator’s image data loading uses a pipeline-based approach optimized for large-scale, distributed curation workflows:

File Partitioning: FilePartitioningStage distributes .tar files across workers for parallel processing.
High-Performance Reading: ImageReaderStage uses NVIDIA DALI to accelerate image loading, decoding, and batching on GPU with CPU fallback.
Tar Archive Format: Processes sharded .tar archives containing JPEG images (other file types are ignored during loading).
Batch Processing: Images are processed in ImageBatch objects containing decoded images, metadata, and processing results.

The result is a stream of ImageBatch objects ready for embedding, classification, and filtering in downstream pipeline stages.

Options

Tar Archive Pipeline

Load and process JPEG images from tar archives using FilePartitioningStage and ImageReaderStage for scalable distributed curation. FilePartitioningStage ImageReaderStage DALI-accelerated