Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Datasets

DocumentDataset

class nemo_curator.datasets.DocumentDataset(dataset_df: dask.dataframe.DataFrame)

A collection of documents and document metadata. Internally it may be distributed across multiple nodes, and may be on GPUs.

classmethod from_pandas(data, npartitions: Optional[int] = 1, chunksize: Optional[int] = None, sort: Optional[bool] = True, name: Optional[str] = None)

Creates a document dataset from a pandas data frame. For more information on the arguments see Dask’s from_pandas documentation https://docs.dask.org/en/stable/generated/dask.dataframe.from_pandas.html

Parameters

data – A pandas dataframe

Returns

A document dataset with a pandas backend (on the CPU).

to_json(output_file_dir, write_to_filename=False)

See nemo_curator.utils.distributed_utils.write_to_disk docstring for other parameters.

to_pandas()

Creates a pandas dataframe from a DocumentDataset

Returns

A pandas dataframe (on the CPU)

to_parquet(output_file_dir, write_to_filename=False)

See nemo_curator.utils.distributed_utils.write_to_disk docstring for other parameters.