WebDataset#
Load and process image-text pair datasets in WebDataset format using NeMo Curator.
WebDataset is a sharded, metadata-rich file format that enables scalable, distributed image curation. It is the primary and currently only supported format for image data loading in NeMo Curator.
How it Works#
A WebDataset directory contains sharded .tar files, each holding image-text pairs and metadata, along with corresponding .parquet files for tabular metadata. Optionally, .idx index files can be provided for fast DALI-based loading. Each record is identified by a unique ID, which is used as the prefix for all files belonging to that record.
Directory Structure Example
dataset/
├── 00000.tar
│ ├── 000000000.jpg
│ ├── 000000000.txt
│ ├── 000000000.json
│ ├── ...
├── 00001.tar
│ ├── ...
├── 00000.parquet
├── 00001.parquet
├── 00000.idx # optional
├── 00001.idx # optional
.tarfiles: Contain images (.jpg), captions (.txt), and metadata (.json).parquetfiles: Tabular metadata for each record.idxfiles: (Optional) Index files for fast DALI-based loading
Each record is identified by a unique ID (for example, 000000031), which is used as the prefix for all files belonging to that record.
Usage#
from nemo_curator.datasets import ImageTextPairDataset
dataset = ImageTextPairDataset.from_webdataset(
path="/path/to/webdataset",
id_col="key" # or the name of your unique ID column
)
path: Path to the root of the WebDataset directory (local or cloud storage)id_col: Name of the unique identifier column in the metadata (commonlykey)
Parameters#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
str |
Path to the WebDataset directory (local or cloud storage) |
Required |
|
str |
Name of the unique identifier column in the metadata (for example, |
Required |
Output Format#
The loaded ImageTextPairDataset object provides access to metadata, images, and captions for downstream curation tasks. The directory contains:
Sharded
.tarfiles with images, captions, and metadata.parquetfiles with tabular metadata(Optional)
.idxfiles for DALI-based loading
Example record structure:
000000031.jpg: Image file000000031.txt: Caption file000000031.json: Metadata file
The ImageTextPairDataset.metadata attribute is a Dask-cuDF DataFrame containing all metadata fields, including the unique ID column.
Customization Options & Performance Tips#
Cloud Storage Support: You can use local paths or cloud storage URLs (for example, S3, GCS, Azure) thanks to
fsspecintegration. Make sure your environment is configured with the appropriate credentials.DALI Index Files: For large datasets, provide
.idxfiles for each.tarto enable fast DALI-based loading (see NVIDIA DALI documentation).GPU Acceleration: Use a GPU-enabled environment for best performance.
Saving Metadata: Use
ImageTextPairDataset.save_metadata()to export metadata as Parquet files.Resharding/Filtering: Use
ImageTextPairDataset.to_webdataset()to reshard or filter the dataset and write a new WebDataset directory.