Data Loading Concepts (Image)#
This page covers the core concepts for loading and managing image datasets in NeMo Curator.
WebDataset Format and Directory Structure#
NeMo Curator uses the WebDataset format for scalable, distributed image curation. A WebDataset directory contains sharded .tar
files, each holding image-text pairs and metadata, along with corresponding .parquet
files for tabular metadata. Optionally, .idx
index files can be provided for fast DALI-based loading.
Example directory structure:
dataset/
├── 00000.tar
│ ├── 000000000.jpg
│ ├── 000000000.txt
│ ├── 000000000.json
│ ├── ...
├── 00001.tar
│ ├── ...
├── 00000.parquet
├── 00001.parquet
├── 00000.idx # optional
├── 00001.idx # optional
.tar
files: Contain images (.jpg
), captions (.txt
), and metadata (.json
).parquet
files: Tabular metadata for each record.idx
files: (Optional) Index files for fast DALI-based loading
Each record is identified by a unique ID (e.g., 000000031
), used as the prefix for all files belonging to that record.
Loading from Local Disk and Cloud Storage#
NeMo Curator supports loading datasets from both local disk and cloud storage (S3, GCS, Azure) using the fsspec library. This allows you to use the same API regardless of where your data is stored.
Example:
from nemo_curator.datasets import ImageTextPairDataset
dataset = ImageTextPairDataset.from_webdataset(
path="/path/to/webdataset", # or "s3://bucket/webdataset"
id_col="key"
)
DALI Integration for High-Performance Loading#
NVIDIA DALI is used for efficient, GPU-accelerated loading and preprocessing of images from WebDataset tar files. DALI enables:
Fast image decoding and augmentation on GPU
Efficient shuffling and batching
Support for large-scale, distributed workflows
Index Files#
For large datasets, DALI can use .idx
index files for each .tar
to enable even faster loading. These index files are generated using DALI’s wds2idx
tool and must be placed alongside the corresponding .tar
files.
How to generate: See DALI documentation
Naming: Each index file must match its
.tar
file (e.g.,00000.tar
→00000.idx
)Usage: Set
use_index_files=True
in your embedder or loader.
Best Practices and Troubleshooting#
Use sharding to enable distributed and parallel processing.
Always include
.parquet
metadata for fast access and filtering.For cloud storage, ensure your environment is configured with the appropriate credentials.
Use
.idx
files for large datasets to maximize DALI performance.Monitor GPU memory and adjust batch size as needed.
If you encounter loading errors, check for missing or mismatched files in your dataset structure.