Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Datasets#

DocumentDataset#

class nemo_curator.datasets.DocumentDataset(dataset_df: dask.dataframe.DataFrame)#

A collection of documents and document metadata. Internally it may be distributed across multiple nodes, and may be on GPUs.

classmethod from_pandas(
data,
npartitions: int | None = 1,
chunksize: int | None = None,
sort: bool | None = True,
name: str | None = None,
)#

Creates a document dataset from a pandas data frame. For more information on the arguments see Dask’s from_pandas documentation https://docs.dask.org/en/stable/generated/dask.dataframe.from_pandas.html

Parameters:

data – A pandas dataframe

Returns:

A document dataset with a pandas backend (on the CPU).

to_json(
output_file_dir,
write_to_filename=False,
keep_filename_column=False,
)#

See nemo_curator.utils.distributed_utils.write_to_disk docstring for other parameters.

to_pandas()#

Creates a pandas dataframe from a DocumentDataset

Returns:

A pandas dataframe (on the CPU)

to_parquet(
output_file_dir,
write_to_filename=False,
keep_filename_column=False,
)#

See nemo_curator.utils.distributed_utils.write_to_disk docstring for other parameters.

ImageTextPairDataset#

class nemo_curator.datasets.ImageTextPairDataset(
path: str,
metadata: dask.dataframe.DataFrame,
tar_files: List[str],
id_col: str,
)#

A collection of image text pairs stored in WebDataset-like format on disk or in cloud storage.

The exact format assumes a single directory with sharded .tar, .parquet, and (optionally) .idx files. Each tar file should have a unique integer ID as its name (00000.tar, 00001.tar, 00002.tar, etc.). The tar files should contain images in .jpg files, text captions in .txt files, and metadata in .json files. Each record of the dataset is identified by a unique ID that is a mix of the shard ID along with the offset of the record within a shard. For example, the 32nd record of the 43rd shard would be in 00042.tar and have image 000420031.jpg, caption 000420031.txt, and metadata 000420031.json (assuming zero indexing).

In addition to the collection of tar files, ImageTextPairDataset expects there to be .parquet files in the root directory that follow the same naming convention as the shards (00042.tar -> 00042.parquet). Each Parquet file should contain an aggregated tabular form of the metadata for each record, with each row in the Parquet file corresponding to a record in that shard. The metadata, both in the Parquet files and the JSON files, must contain a unique ID column that is the same as its record ID (000420031 in our examples).

Index files may also be in the directory to speed up dataloading with DALI. The index files must be generated by DALI’s wds2idx tool. See https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/general/data_loading/dataloading_webdataset.html#Creating-an-index for more information. Each index file must follow the same naming convention as the tar files (00042.tar -> 00042.idx).

classmethod from_webdataset(path: str, id_col: str)#

Loads an ImageTextPairDataset from a WebDataset

Parameters:
  • path (str) – The path to the WebDataset-like format on disk or cloud storage.

  • id_col (str) – The column storing the unique identifier for each record.

save_metadata(
path: str | None = None,
columns: List[str] | None = None,
) None#

Saves the metadata of the dataset to the specified path as a collection of Parquet files.

Parameters:
  • path (Optional[str]) – The path to save the metadata to. If None, writes to the original path.

  • columns (Optional[List[str]]) – If specified, only saves a subset of columns.

to_webdataset(
path: str,
filter_column: str,
samples_per_shard: int = 10000,
max_shards: int = 5,
old_id_col: str | None = None,
) None#

Saves the dataset to a WebDataset format with Parquet files. Will reshard the tar files to the specified number of samples per shard. The ID value in ImageTextPairDataset.id_col will be overwritten with a new ID.

Parameters:
  • path (str) – The output path where the dataset should be written.

  • filter_column (str) – A column of booleans. All samples with a value of True in this column will be included in the output. Otherwise, the sample will be omitted.

  • samples_per_shard (int) – The number of samples to include in each tar file.

  • max_shards (int) – The order of magnitude of the maximum number of shards that will be created from the dataset. Will be used to determine the number of leading zeros in the shard/sample IDs.

  • old_id_col (Optional[str]) – If specified, will preserve the previous ID value in the given column.