***

description: >-
Core concepts for loading and managing image datasets from tar archives with
cloud storage support
categories:

* concepts-architecture
  tags:
* data-loading
* tar-archives
* dali
* cloud-storage
* gpu-accelerated
  personas:
* data-scientist-focused
* mle-focused
  difficulty: intermediate
  content\_type: concept
  modality: image-only

***

# Data Loading Concepts (Image)

This page covers the core concepts for loading and managing image datasets in NeMo Curator.

## Input Data Format and Directory Structure

NeMo Curator loads image datasets from tar archives for scalable, distributed image curation. The `ImageReaderStage` reads only JPEG images from input `.tar` files, ignoring other content.

**Example input directory structure:**

```bash
input_dataset/
├── 00000.tar          # Tar archive containing JPEG images
│   ├── 000000000.jpg
│   ├── 000000001.jpg
│   ├── 000000002.jpg
│   ├── ...
├── 00001.tar
│   ├── 000001000.jpg
│   ├── 000001001.jpg
│   ├── ...
```

**What gets loaded:**

* `.tar` files: Tar archives containing JPEG images (`.jpg`)
* Only JPEG images are extracted and processed

<Note>
  **WebDataset Format Support**: If your tar archives follow the [WebDataset format](https://github.com/webdataset/webdataset) and contain additional files (captions as `.txt`, metadata as `.json`), the `ImageReaderStage` will **only extract JPEG images**. Other file types (`.txt`, `.json`, etc.) are automatically ignored during loading.
</Note>

Each record is identified by a unique ID (e.g., `000000031`), used as the prefix for all files belonging to that record.

## Loading from Local Disk

**Example:**

```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.file_partitioning import FilePartitioningStage
from nemo_curator.stages.image.io.image_reader import ImageReaderStage

# Create pipeline for loading
pipeline = Pipeline(name="image_loading")

# Partition tar files for parallel processing
pipeline.add_stage(FilePartitioningStage(
    file_paths="/path/to/tar_dataset",
    files_per_partition=1,         # Process one tar file per partition
    file_extensions=[".tar"],       # Only include .tar files
))

# Load JPEG images from tar files using DALI
pipeline.add_stage(ImageReaderStage(
    batch_size=100,                 # Number of images per batch
    verbose=True,
    num_threads=8,                  # Number of threads for I/O operations
    num_gpus_per_worker=0.25,       # Allocate 1/4 GPU per worker
))

# Execute the pipeline
results = pipeline.run()
```

## DALI Integration for High-Performance Loading

The `ImageReaderStage` uses [NVIDIA DALI](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/) for efficient, GPU-accelerated loading and preprocessing of JPEG images from tar files. DALI enables:

* **GPU Acceleration:** Fast image decoding on GPU with automatic CPU fallback
* **Batch Processing:** Efficient batching and streaming of image data
* **Tar Archive Processing:** Built-in support for tar archive format
* **Memory Efficiency:** Streams images without loading entire datasets into memory

## Best Practices and Troubleshooting

* Use sharding to enable distributed and parallel processing.
* Watch GPU memory and adjust batch size as needed.
* If you encounter loading errors, check for missing or mismatched files in your dataset structure.
