Image Data Loading#
Load image data for curation using NeMo Curator. The primary supported format is tar archives containing JPEG images, which enables efficient distributed processing of large-scale image datasets.
How it Works#
NeMo Curator’s image data loading uses a pipeline-based approach optimized for large-scale, distributed curation workflows:
File Partitioning:
FilePartitioningStage
distributes.tar
files across workers for parallel processing.High-Performance Reading:
ImageReaderStage
uses NVIDIA DALI to accelerate image loading, decoding, and batching on GPU with CPU fallback.Tar Archive Format: Processes sharded
.tar
archives containing JPEG images (other file types are ignored during loading).Batch Processing: Images are processed in
ImageBatch
objects containing decoded images, metadata, and processing results.
The result is a stream of ImageBatch
objects ready for embedding, classification, and filtering in downstream pipeline stages.
Options#
Load and process JPEG images from tar archives using FilePartitioningStage
and ImageReaderStage
for scalable distributed curation.