Data Loading Concepts (Image)#
This page covers the core concepts for loading and managing image datasets in NeMo Curator.
WebDataset Format and Directory Structure#
NeMo Curator uses the WebDataset format for scalable, distributed image curation. A WebDataset directory contains sharded .tar files, each holding image-text pairs and metadata, along with corresponding .parquet files for tabular metadata. Optionally, .idx index files can be provided for fast DALI-based loading.
Example directory structure:
dataset/
├── 00000.tar
│ ├── 000000000.jpg
│ ├── 000000000.txt
│ ├── 000000000.json
│ ├── ...
├── 00001.tar
│ ├── ...
├── 00000.parquet
├── 00001.parquet
├── 00000.idx # optional
├── 00001.idx # optional
.tarfiles: Contain images (.jpg), captions (.txt), and metadata (.json).parquetfiles: Tabular metadata for each record.idxfiles: (Optional) Index files for fast DALI-based loading
Each record is identified by a unique ID (e.g., 000000031), used as the prefix for all files belonging to that record.
Loading from Local Disk and Cloud Storage#
NeMo Curator supports loading datasets from both local disk and cloud storage (S3, GCS, Azure) using the fsspec library. This allows you to use the same API regardless of where your data is stored.
Example:
from nemo_curator.datasets import ImageTextPairDataset
dataset = ImageTextPairDataset.from_webdataset(
path="/path/to/webdataset", # or "s3://bucket/webdataset"
id_col="key"
)
DALI Integration for High-Performance Loading#
NVIDIA DALI is used for efficient, GPU-accelerated loading and preprocessing of images from WebDataset tar files. DALI enables:
Fast image decoding and augmentation on GPU
Efficient shuffling and batching
Support for large-scale, distributed workflows
Index Files#
For large datasets, DALI can use .idx index files for each .tar to enable even faster loading. These index files are generated using DALI’s wds2idx tool and must be placed alongside the corresponding .tar files.
How to generate: See DALI documentation
Naming: Each index file must match its
.tarfile (e.g.,00000.tar→00000.idx)Usage: Set
use_index_files=Truein your embedder or loader.
Best Practices and Troubleshooting#
Use sharding to enable distributed and parallel processing.
Always include
.parquetmetadata for fast access and filtering.For cloud storage, ensure your environment is configured with the appropriate credentials.
Use
.idxfiles for large datasets to maximize DALI performance.Monitor GPU memory and adjust batch size as needed.
If you encounter loading errors, check for missing or mismatched files in your dataset structure.