*** description: >- Load and process JPEG images from tar archives using DALI-powered GPU acceleration with distributed processing categories: * how-to-guides tags: * tar-archives * data-loading * dali * gpu-acceleration * distributed personas: * data-scientist-focused * mle-focused difficulty: intermediate content\_type: how-to modality: image-only *** # Loading Images from Tar Archives Load and process JPEG images from tar archives using NeMo Curator's DALI-powered `ImageReaderStage`. The `ImageReaderStage` uses NVIDIA DALI for high-performance image decoding with GPU acceleration and automatic CPU fallback, designed for processing large collections of images stored in tar files. ## How it Works The `ImageReaderStage` processes directories containing `.tar` files with JPEG images. While tar files may contain other file types (text, JSON, etc.), the stage extracts only JPEG images for processing. **Directory Structure Example** ```text dataset/ ├── 00000.tar │ ├── 000000000.jpg │ ├── 000000001.jpg │ ├── 000000002.jpg │ ├── ... ├── 00001.tar │ ├── 000001000.jpg │ ├── 000001001.jpg │ ├── ... ``` **What gets processed:** * **JPEG images**: All `.jpg` files within tar archives **What gets ignored:** * Text files (`.txt`), JSON files (`.json`), and other non-JPEG content within tar archives * Any files outside the tar archives (like standalone Parquet files) *** ## Usage ```python from nemo_curator.pipeline import Pipeline from nemo_curator.stages.file_partitioning import FilePartitioningStage from nemo_curator.stages.image.io.image_reader import ImageReaderStage # Create pipeline pipeline = Pipeline(name="image_loading", description="Load images from tar archives") # Stage 1: Partition tar files for parallel processing pipeline.add_stage(FilePartitioningStage( file_paths="/path/to/tar_dataset", files_per_partition=1, file_extensions=[".tar"], )) # Stage 2: Read JPEG images from tar files using DALI pipeline.add_stage(ImageReaderStage( batch_size=100, verbose=True, num_threads=8, num_gpus_per_worker=0.25, )) # Run the pipeline (uses XennaExecutor by default) results = pipeline.run() ``` **Parameters:** * `file_paths`: Path to directory containing tar files * `files_per_partition`: Number of tar files to process per partition (controls parallelism) * `batch_size`: Number of images per ImageBatch for processing *** ## ImageReaderStage Details The `ImageReaderStage` is the core component that handles tar archive loading with the following capabilities: ### DALI Integration * **Automatic Device Selection**: Uses GPU decoding when CUDA is available, CPU decoding otherwise * **Tar Archive Reader**: Leverages DALI's tar archive reader to process tar files * **Batch Processing**: Processes images in configurable batch sizes for memory efficiency * **JPEG-Only Processing**: Extracts only JPEG files (`ext=["jpg"]`) from tar archives ### Image Processing * **Format Support**: Reads only JPEG images (`.jpg`) from tar files * **Size Preservation**: Maintains original image dimensions (no automatic resizing) * **RGB Output**: Converts images to RGB format for consistent downstream processing * **Metadata Extraction**: Creates ImageObject instances with image paths and generated IDs ### Error Handling * **Missing Components**: Skips missing or corrupted images with `missing_component_behavior="skip"` * **Graceful Fallback**: Automatically falls back to CPU processing if GPU is unavailable * **Validation**: Validates tar file paths and provides clear error messages * **Non-JPEG Filtering**: Silently ignores non-JPEG files within tar archives *** ## Parameters ### ImageReaderStage Parameters | Parameter | Type | Default | Description | | --------------------- | ----- | ------- | ---------------------------------------------- | | `batch_size` | int | 100 | Number of images per ImageBatch for processing | | `verbose` | bool | True | Enable verbose logging for debugging | | `num_threads` | int | 8 | Number of threads for DALI operations | | `num_gpus_per_worker` | float | 0.25 | GPU allocation per worker (0.25 = 1/4 GPU) | *** ## Output Format The pipeline produces `ImageBatch` objects containing `ImageObject` instances for downstream curation tasks. Each `ImageObject` contains: * `image_data`: Raw image pixel data as numpy array (H, W, C) in RGB format * `image_path`: Path to the original image file in the tar * `image_id`: Unique identifier extracted from the filename * `metadata`: Additional metadata dictionary **Example ImageObject structure:** ```python ImageObject( image_path="00000.tar/000000031.jpg", image_id="000000031", image_data=np.array(...), # Shape: (H, W, 3) metadata={} ) ``` **Note**: Only JPEG images are extracted from tar files. Other content (text files, JSON metadata, etc.) within the tar archives is ignored during processing.