WebDataset#

Load and process image-text pair datasets in WebDataset format using NeMo Curator.

WebDataset is a sharded, metadata-rich file format that enables scalable, distributed image curation. It is the primary and currently only supported format for image data loading in NeMo Curator.

How it Works#

A WebDataset directory contains sharded .tar files, each holding image-text pairs and metadata, along with corresponding .parquet files for tabular metadata. Optionally, .idx index files can be provided for fast DALI-based loading. Each record is identified by a unique ID, which is used as the prefix for all files belonging to that record.

Directory Structure Example

dataset/
├── 00000.tar
│   ├── 000000000.jpg
│   ├── 000000000.txt
│   ├── 000000000.json
│   ├── ...
├── 00001.tar
│   ├── ...
├── 00000.parquet
├── 00001.parquet
├── 00000.idx  # optional
├── 00001.idx  # optional

.tar files: Contain images (.jpg), captions (.txt), and metadata (.json)
.parquet files: Tabular metadata for each record
.idx files: (Optional) Index files for fast DALI-based loading

Each record is identified by a unique ID (for example, 000000031), which is used as the prefix for all files belonging to that record.

Usage#

from nemo_curator.datasets import ImageTextPairDataset

dataset = ImageTextPairDataset.from_webdataset(
    path="/path/to/webdataset",
    id_col="key"  # or the name of your unique ID column
)

path: Path to the root of the WebDataset directory (local or cloud storage)
id_col: Name of the unique identifier column in the metadata (commonly key)

Parameters#

Table 13 WebDataset Loading Parameters#
Parameter	Type	Description	Default
`path`	str	Path to the WebDataset directory (local or cloud storage)	Required
`id_col`	str	Name of the unique identifier column in the metadata (for example, `key`)	Required

Output Format#

The loaded ImageTextPairDataset object provides access to metadata, images, and captions for downstream curation tasks. The directory contains:

Sharded .tar files with images, captions, and metadata
.parquet files with tabular metadata
(Optional) .idx files for DALI-based loading

Example record structure:

000000031.jpg: Image file
000000031.txt: Caption file
000000031.json: Metadata file

The ImageTextPairDataset.metadata attribute is a Dask-cuDF DataFrame containing all metadata fields, including the unique ID column.

Customization Options & Performance Tips#

Cloud Storage Support: You can use local paths or cloud storage URLs (for example, S3, GCS, Azure) thanks to fsspec integration. Make sure your environment is configured with the appropriate credentials.
DALI Index Files: For large datasets, provide .idx files for each .tar to enable fast DALI-based loading (see NVIDIA DALI documentation).
GPU Acceleration: Use a GPU-enabled environment for best performance.
Saving Metadata: Use ImageTextPairDataset.save_metadata() to export metadata as Parquet files.
Resharding/Filtering: Use ImageTextPairDataset.to_webdataset() to reshard or filter the dataset and write a new WebDataset directory.