Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Datasets#

DocumentDataset#

class nemo_curator.datasets.DocumentDataset(dataset_df: dask.dataframe.DataFrame)#

A collection of documents and document metadata. Internally it may be distributed across multiple nodes, and may be on GPUs.

classmethod from_pandas(
data,
npartitions: int | None = 1,
chunksize: int | None = None,
sort: bool | None = True,
name: str | None = None,
)#

Creates a document dataset from a pandas data frame. For more information on the arguments see Dask’s from_pandas documentation https://docs.dask.org/en/stable/generated/dask.dataframe.from_pandas.html

Parameters:

data – A pandas dataframe

Returns:

A document dataset with a pandas backend (on the CPU).

to_json(output_file_dir, write_to_filename=False)#

See nemo_curator.utils.distributed_utils.write_to_disk docstring for other parameters.

to_pandas()#

Creates a pandas dataframe from a DocumentDataset

Returns:

A pandas dataframe (on the CPU)

to_parquet(output_file_dir, write_to_filename=False)#

See nemo_curator.utils.distributed_utils.write_to_disk docstring for other parameters.