For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
      • Overview
      • Deduplication
        • Overview
          • Loading
          • Data Processing
          • Data Export
  • Get Started
    • Overview
    • Install (All Modalities)
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
    • Save and Export
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
    • Save and Export
  • Curate Audio
    • Overview
    • Save and Export
  • Setup & Deployment
    • Overview
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Input Data Format and Directory Structure
  • Loading from Local Disk
  • DALI Integration for High-Performance Loading
  • Best Practices and Troubleshooting
About NeMo CuratorConceptsImage ConceptsData

Data Loading Concepts (Image)

||View as Markdown|
Previous

Overview

Next

Data Processing

This page covers the core concepts for loading and managing image datasets in NeMo Curator.

Input Data Format and Directory Structure

NeMo Curator loads image datasets from tar archives for scalable, distributed image curation. The ImageReaderStage reads only JPEG images from input .tar files, ignoring other content.

Example input directory structure:

$input_dataset/
$├── 00000.tar # Tar archive containing JPEG images
$│ ├── 000000000.jpg
$│ ├── 000000001.jpg
$│ ├── 000000002.jpg
$│ ├── ...
$├── 00001.tar
$│ ├── 000001000.jpg
$│ ├── 000001001.jpg
$│ ├── ...

What gets loaded:

  • .tar files: Tar archives containing JPEG images (.jpg)
  • Only JPEG images are extracted and processed

WebDataset Format Support: If your tar archives follow the WebDataset format and contain additional files (captions as .txt, metadata as .json), the ImageReaderStage will only extract JPEG images. Other file types (.txt, .json, etc.) are automatically ignored during loading.

Each record is identified by a unique ID (e.g., 000000031), used as the prefix for all files belonging to that record.

Loading from Local Disk

Example:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.file_partitioning import FilePartitioningStage
3from nemo_curator.stages.image.io.image_reader import ImageReaderStage
4
5# Create pipeline for loading
6pipeline = Pipeline(name="image_loading")
7
8# Partition tar files for parallel processing
9pipeline.add_stage(FilePartitioningStage(
10 file_paths="/path/to/tar_dataset",
11 files_per_partition=1, # Process one tar file per partition
12 file_extensions=[".tar"], # Only include .tar files
13))
14
15# Load JPEG images from tar files using DALI
16pipeline.add_stage(ImageReaderStage(
17 dali_batch_size=100, # Number of images per batch
18 verbose=True,
19 num_threads=8, # Number of threads for I/O operations
20 num_gpus_per_worker=0.25, # Allocate 1/4 GPU per worker
21))
22
23# Execute the pipeline
24results = pipeline.run()

DALI Integration for High-Performance Loading

The ImageReaderStage uses NVIDIA DALI for efficient, GPU-accelerated loading and preprocessing of JPEG images from tar files. DALI enables:

  • GPU Acceleration: Fast image decoding on GPU with automatic CPU fallback
  • Batch Processing: Efficient batching and streaming of image data
  • Tar Archive Processing: Built-in support for tar archive format
  • Memory Efficiency: Streams images without loading entire datasets into memory

Best Practices and Troubleshooting

  • Use sharding to enable distributed and parallel processing.
  • Watch GPU memory and adjust batch size as needed.
  • If you encounter loading errors, check for missing or mismatched files in your dataset structure.