TAR Archives | NeMo Curator

Load and process JPEG images from tar archives using NeMo Curator’s DALI-powered ImageReaderStage.

The ImageReaderStage uses NVIDIA DALI for high-performance image decoding with GPU acceleration and automatic CPU fallback, designed for processing large collections of images stored in tar files.

How it Works

The ImageReaderStage processes directories containing .tar files with JPEG images. While tar files may contain other file types (text, JSON, etc.), the stage extracts only JPEG images for processing.

Directory Structure Example

dataset/
├── 00000.tar
│   ├── 000000000.jpg
│   ├── 000000001.jpg
│   ├── 000000002.jpg
│   ├── ...
├── 00001.tar
│   ├── 000001000.jpg
│   ├── 000001001.jpg
│   ├── ...

What gets processed:

JPEG images: All .jpg files within tar archives

What gets ignored:

Text files (.txt), JSON files (.json), and other non-JPEG content within tar archives
Any files outside the tar archives (like standalone Parquet files)

Usage

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.file_partitioning import FilePartitioningStage
3 from nemo_curator.stages.image.io.image_reader import ImageReaderStage
4 
5 # Create pipeline
6 pipeline = Pipeline(name="image_loading", description="Load images from tar archives")
7 
8 # Stage 1: Partition tar files for parallel processing
9 pipeline.add_stage(FilePartitioningStage(
10     file_paths="/path/to/tar_dataset",
11     files_per_partition=1,
12     file_extensions=[".tar"],
13 ))
14 
15 # Stage 2: Read JPEG images from tar files using DALI
16 pipeline.add_stage(ImageReaderStage(
17     dali_batch_size=100,
18     verbose=True,
19     num_threads=8,
20     num_gpus_per_worker=0.25,
21 ))
22 
23 # Run the pipeline
24 results = pipeline.run()

ImageReaderStage is compatible with both XennaExecutor and RayDataExecutor. When using RayDataExecutor, the stage automatically signals that it fans out (one tar file can produce multiple ImageBatch objects), which enables Ray Data to repartition batches across downstream workers for parallel processing.

Parameters:

file_paths: Path to directory containing tar files
files_per_partition: Number of tar files to process per partition (controls parallelism)
dali_batch_size: Number of images per ImageBatch for processing

ImageReaderStage Details

The ImageReaderStage is the core component that handles tar archive loading with the following capabilities:

DALI Integration

Automatic Device Selection: Uses GPU decoding when CUDA is available, CPU decoding otherwise
Tar Archive Reader: Leverages DALI’s tar archive reader to process tar files
Batch Processing: Processes images in configurable batch sizes for memory efficiency
JPEG-Only Processing: Extracts only JPEG files (ext=["jpg"]) from tar archives

Image Processing

Format Support: Reads only JPEG images (.jpg) from tar files
Size Preservation: Maintains original image dimensions (no automatic resizing)
RGB Output: Converts images to RGB format for consistent downstream processing
Metadata Extraction: Creates ImageObject instances with image paths and generated IDs

Error Handling

Missing Components: Skips missing or corrupted images with missing_component_behavior="skip"
Graceful Fallback: Automatically falls back to CPU processing if GPU is unavailable
Validation: Validates tar file paths and provides clear error messages
Non-JPEG Filtering: Silently ignores non-JPEG files within tar archives

Parameters

ImageReaderStage Parameters

Parameter	Type	Default	Description
`dali_batch_size`	int	100	Number of images per ImageBatch for processing
`verbose`	bool	True	Enable verbose logging for debugging
`num_threads`	int	8	Number of threads for DALI operations
`num_gpus_per_worker`	float	0.25	GPU allocation per worker (0.25 = 1/4 GPU)

Output Format

The pipeline produces ImageBatch objects containing ImageObject instances for downstream curation tasks. Each ImageObject contains:

image_data: Raw image pixel data as numpy array (H, W, C) in RGB format
image_path: Path to the original image file in the tar
image_id: Unique identifier extracted from the filename
metadata: Additional metadata dictionary

Example ImageObject structure:

1 ImageObject(
2     image_path="00000.tar/000000031.jpg",
3     image_id="000000031", 
4     image_data=np.array(...),  # Shape: (H, W, 3)
5     metadata={}
6 )

Note: Only JPEG images are extracted from tar files. Other content (text files, JSON metadata, etc.) within the tar archives is ignored during processing.