Loading | NeMo Curator

This page covers the core concepts for loading and managing image datasets in NeMo Curator.

Input Data Format and Directory Structure

NeMo Curator loads image datasets from tar archives for scalable, distributed image curation. The ImageReaderStage reads only JPEG images from input .tar files, ignoring other content.

Example input directory structure:

$ input_dataset/
$ ├── 00000.tar          # Tar archive containing JPEG images
$ │   ├── 000000000.jpg
$ │   ├── 000000001.jpg
$ │   ├── 000000002.jpg
$ │   ├── ...
$ ├── 00001.tar
$ │   ├── 000001000.jpg
$ │   ├── 000001001.jpg
$ │   ├── ...

What gets loaded:

.tar files: Tar archives containing JPEG images (.jpg)
Only JPEG images are extracted and processed

WebDataset Format Support: If your tar archives follow the WebDataset format and contain additional files (captions as .txt, metadata as .json), the ImageReaderStage will only extract JPEG images. Other file types (.txt, .json, etc.) are automatically ignored during loading.

Each record is identified by a unique ID (e.g., 000000031), used as the prefix for all files belonging to that record.

Loading from Local Disk

Example:

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.file_partitioning import FilePartitioningStage
3 from nemo_curator.stages.image.io.image_reader import ImageReaderStage
4 
5 # Create pipeline for loading
6 pipeline = Pipeline(name="image_loading")
7 
8 # Partition tar files for parallel processing
9 pipeline.add_stage(FilePartitioningStage(
10     file_paths="/path/to/tar_dataset",
11     files_per_partition=1,         # Process one tar file per partition
12     file_extensions=[".tar"],       # Only include .tar files
13 ))
14 
15 # Load JPEG images from tar files using DALI
16 pipeline.add_stage(ImageReaderStage(
17     dali_batch_size=100,            # Number of images per batch
18     verbose=True,
19     num_threads=8,                  # Number of threads for I/O operations
20     num_gpus_per_worker=0.25,       # Allocate 1/4 GPU per worker
21 ))
22 
23 # Execute the pipeline
24 results = pipeline.run()

DALI Integration for High-Performance Loading

The ImageReaderStage uses NVIDIA DALI for efficient, GPU-accelerated loading and preprocessing of JPEG images from tar files. DALI enables:

GPU Acceleration: Fast image decoding on GPU with automatic CPU fallback
Batch Processing: Efficient batching and streaming of image data
Tar Archive Processing: Built-in support for tar archive format
Memory Efficiency: Streams images without loading entire datasets into memory

Best Practices and Troubleshooting

Use sharding to enable distributed and parallel processing.
Watch GPU memory and adjust batch size as needed.
If you encounter loading errors, check for missing or mismatched files in your dataset structure.