Data Loading Concepts (Image)
This page covers the core concepts for loading and managing image datasets in NeMo Curator.
Input Data Format and Directory Structure
NeMo Curator loads image datasets from tar archives for scalable, distributed image curation. The ImageReaderStage reads only JPEG images from input .tar files, ignoring other content.
Example input directory structure:
What gets loaded:
.tarfiles: Tar archives containing JPEG images (.jpg)- Only JPEG images are extracted and processed
WebDataset Format Support: If your tar archives follow the WebDataset format and contain additional files (captions as .txt, metadata as .json), the ImageReaderStage will only extract JPEG images. Other file types (.txt, .json, etc.) are automatically ignored during loading.
Each record is identified by a unique ID (e.g., 000000031), used as the prefix for all files belonging to that record.
Loading from Local Disk
Example:
DALI Integration for High-Performance Loading
The ImageReaderStage uses NVIDIA DALI for efficient, GPU-accelerated loading and preprocessing of JPEG images from tar files. DALI enables:
- GPU Acceleration: Fast image decoding on GPU with automatic CPU fallback
- Batch Processing: Efficient batching and streaming of image data
- Tar Archive Processing: Built-in support for tar archive format
- Memory Efficiency: Streams images without loading entire datasets into memory
Best Practices and Troubleshooting
- Use sharding to enable distributed and parallel processing.
- Watch GPU memory and adjust batch size as needed.
- If you encounter loading errors, check for missing or mismatched files in your dataset structure.