Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Datasets
DocumentDataset
- class nemo_curator.datasets.DocumentDataset(dataset_df: dask.dataframe.DataFrame)
A collection of documents and document metadata. Internally it may be distributed across multiple nodes, and may be on GPUs.
- classmethod from_pandas(data, npartitions: Optional[int] = 1, chunksize: Optional[int] = None, sort: Optional[bool] = True, name: Optional[str] = None)
Creates a document dataset from a pandas data frame. For more information on the arguments see Dask’s from_pandas documentation https://docs.dask.org/en/stable/generated/dask.dataframe.from_pandas.html
- Parameters
data – A pandas dataframe
- Returns
A document dataset with a pandas backend (on the CPU).
- to_json(output_file_dir, write_to_filename=False)
See nemo_curator.utils.distributed_utils.write_to_disk docstring for other parameters.
- to_pandas()
Creates a pandas dataframe from a DocumentDataset
- Returns
A pandas dataframe (on the CPU)
- to_parquet(output_file_dir, write_to_filename=False)
See nemo_curator.utils.distributed_utils.write_to_disk docstring for other parameters.