WebDataset#
Load and process image-text pair datasets in WebDataset format using NeMo Curator.
WebDataset is a sharded, metadata-rich file format that enables scalable, distributed image curation. It is the primary and currently only supported format for image data loading in NeMo Curator.
How it Works#
A WebDataset directory contains sharded .tar
files, each holding image-text pairs and metadata, along with corresponding .parquet
files for tabular metadata. Optionally, .idx
index files can be provided for fast DALI-based loading. Each record is identified by a unique ID, which is used as the prefix for all files belonging to that record.
Directory Structure Example
dataset/
├── 00000.tar
│ ├── 000000000.jpg
│ ├── 000000000.txt
│ ├── 000000000.json
│ ├── ...
├── 00001.tar
│ ├── ...
├── 00000.parquet
├── 00001.parquet
├── 00000.idx # optional
├── 00001.idx # optional
.tar
files: Contain images (.jpg
), captions (.txt
), and metadata (.json
).parquet
files: Tabular metadata for each record.idx
files: (Optional) Index files for fast DALI-based loading
Each record is identified by a unique ID (for example, 000000031
), which is used as the prefix for all files belonging to that record.
Usage#
from nemo_curator.datasets import ImageTextPairDataset
dataset = ImageTextPairDataset.from_webdataset(
path="/path/to/webdataset",
id_col="key" # or the name of your unique ID column
)
path
: Path to the root of the WebDataset directory (local or cloud storage)id_col
: Name of the unique identifier column in the metadata (commonlykey
)
Parameters#
Parameter |
Type |
Description |
Default |
---|---|---|---|
|
str |
Path to the WebDataset directory (local or cloud storage) |
Required |
|
str |
Name of the unique identifier column in the metadata (for example, |
Required |
Output Format#
The loaded ImageTextPairDataset
object provides access to metadata, images, and captions for downstream curation tasks. The directory contains:
Sharded
.tar
files with images, captions, and metadata.parquet
files with tabular metadata(Optional)
.idx
files for DALI-based loading
Example record structure:
000000031.jpg
: Image file000000031.txt
: Caption file000000031.json
: Metadata file
The ImageTextPairDataset.metadata
attribute is a Dask-cuDF DataFrame containing all metadata fields, including the unique ID column.
Customization Options & Performance Tips#
Cloud Storage Support: You can use local paths or cloud storage URLs (for example, S3, GCS, Azure) thanks to
fsspec
integration. Make sure your environment is configured with the appropriate credentials.DALI Index Files: For large datasets, provide
.idx
files for each.tar
to enable fast DALI-based loading (see NVIDIA DALI documentation).GPU Acceleration: Use a GPU-enabled environment for best performance.
Saving Metadata: Use
ImageTextPairDataset.save_metadata()
to export metadata as Parquet files.Resharding/Filtering: Use
ImageTextPairDataset.to_webdataset()
to reshard or filter the dataset and write a new WebDataset directory.