Loading Images from Tar Archives
Load and process JPEG images from tar archives using NeMo Curator’s DALI-powered ImageReaderStage.
The ImageReaderStage uses NVIDIA DALI for high-performance image decoding with GPU acceleration and automatic CPU fallback, designed for processing large collections of images stored in tar files.
How it Works
The ImageReaderStage processes directories containing .tar files with JPEG images. While tar files may contain other file types (text, JSON, etc.), the stage extracts only JPEG images for processing.
Directory Structure Example
What gets processed:
- JPEG images: All
.jpgfiles within tar archives
What gets ignored:
- Text files (
.txt), JSON files (.json), and other non-JPEG content within tar archives - Any files outside the tar archives (like standalone Parquet files)
Usage
Parameters:
file_paths: Path to directory containing tar filesfiles_per_partition: Number of tar files to process per partition (controls parallelism)batch_size: Number of images per ImageBatch for processing
ImageReaderStage Details
The ImageReaderStage is the core component that handles tar archive loading with the following capabilities:
DALI Integration
- Automatic Device Selection: Uses GPU decoding when CUDA is available, CPU decoding otherwise
- Tar Archive Reader: Leverages DALI’s tar archive reader to process tar files
- Batch Processing: Processes images in configurable batch sizes for memory efficiency
- JPEG-Only Processing: Extracts only JPEG files (
ext=["jpg"]) from tar archives
Image Processing
- Format Support: Reads only JPEG images (
.jpg) from tar files - Size Preservation: Maintains original image dimensions (no automatic resizing)
- RGB Output: Converts images to RGB format for consistent downstream processing
- Metadata Extraction: Creates ImageObject instances with image paths and generated IDs
Error Handling
- Missing Components: Skips missing or corrupted images with
missing_component_behavior="skip" - Graceful Fallback: Automatically falls back to CPU processing if GPU is unavailable
- Validation: Validates tar file paths and provides clear error messages
- Non-JPEG Filtering: Silently ignores non-JPEG files within tar archives
Parameters
ImageReaderStage Parameters
Output Format
The pipeline produces ImageBatch objects containing ImageObject instances for downstream curation tasks. Each ImageObject contains:
image_data: Raw image pixel data as numpy array (H, W, C) in RGB formatimage_path: Path to the original image file in the tarimage_id: Unique identifier extracted from the filenamemetadata: Additional metadata dictionary
Example ImageObject structure:
Note: Only JPEG images are extracted from tar files. Other content (text files, JSON metadata, etc.) within the tar archives is ignored during processing.