Load and process JPEG images from tar archives using NeMo Curator’s DALI-powered ImageReaderStage.
The ImageReaderStage uses NVIDIA DALI for high-performance image decoding with GPU acceleration and automatic CPU fallback, designed for processing large collections of images stored in tar files.
The ImageReaderStage processes directories containing .tar files with JPEG images. While tar files may contain other file types (text, JSON, etc.), the stage extracts only JPEG images for processing.
Directory Structure Example
What gets processed:
.jpg files within tar archivesWhat gets ignored:
.txt), JSON files (.json), and other non-JPEG content within tar archivesImageReaderStage is compatible with both XennaExecutor and RayDataExecutor. When using RayDataExecutor, the stage automatically signals that it fans out (one tar file can produce multiple ImageBatch objects), which enables Ray Data to repartition batches across downstream workers for parallel processing.
Parameters:
file_paths: Path to directory containing tar filesfiles_per_partition: Number of tar files to process per partition (controls parallelism)dali_batch_size: Number of images per ImageBatch for processingThe ImageReaderStage is the core component that handles tar archive loading with the following capabilities:
ext=["jpg"]) from tar archives.jpg) from tar filesmissing_component_behavior="skip"The pipeline produces ImageBatch objects containing ImageObject instances for downstream curation tasks. Each ImageObject contains:
image_data: Raw image pixel data as numpy array (H, W, C) in RGB formatimage_path: Path to the original image file in the tarimage_id: Unique identifier extracted from the filenamemetadata: Additional metadata dictionaryExample ImageObject structure:
Note: Only JPEG images are extracted from tar files. Other content (text files, JSON metadata, etc.) within the tar archives is ignored during processing.