`datasets.doc_dataset`#

Module Contents#

Classes#

DocumentDataset

A collection of documents and document metadata. Internally it may be distributed across multiple nodes, and may be on GPUs.

API#

class datasets.doc_dataset.DocumentDataset(dataset_df: dask.dataframe.DataFrame)#

A collection of documents and document metadata. Internally it may be distributed across multiple nodes, and may be on GPUs.

Initialization

classmethod from_pandas( data: pandas.DataFrame, npartitions: int | None = 1, chunksize: int | None = None, sort: bool | None = True, name: str | None = None, ) → datasets.doc_dataset.DocumentDataset#

Creates a document dataset from a pandas data frame. For more information on the arguments see Dask’s from_pandas documentation https://docs.dask.org/en/stable/generated/dask.dataframe.from_pandas.html

Args: data: A pandas dataframe Returns: A document dataset with a pandas backend (on the CPU).

head(n: int = 5) → datasets.doc_dataset.cudf | pandas.DataFrame#

persist() → datasets.doc_dataset.DocumentDataset#

classmethod read_custom(

input_files: str | list[str],

file_type: str,

read_func_single_partition: collections.abc.Callable[[list[str], str, bool, str | dict, dict], datasets.doc_dataset.cudf | pandas.DataFrame],

files_per_partition: int | None = None,

backend: Literal[pandas, datasets.doc_dataset.cudf] | None = None,

add_filename: bool | str = False,

columns: list[str] | None = None,

input_meta: str | dict | None = None,

**kwargs,

) → datasets.doc_dataset.DocumentDataset#

Read custom data from a file or directory based on a custom read function.

Args: input_files: The path of the input file(s). If input_file is a string that ends with the file_type, we consider it as a single file. If input_file is a string that does not end with the file_type, we consider it as a directory and read all files under the directory. If input_file is a list of strings, we assume each string is a file path. file_type: The type of the file to read. read_func_single_partition: A function that reads a single file or a list of files in an single Dask partition. The function should take the following arguments: - files: A list of file paths. - file_type: The type of the file to read (in case you want to handle different file types differently). - backend: Read below - add_filename: Read below - columns: Read below - input_meta: Read below backend: The backend to use for reading the data, in case you want to handle pd.DataFrame or cudf.DataFrame. files_per_partition: The number of files to read per partition. add_filename: Whether to add a filename column to the DataFrame. If True, a new column is added to the DataFrame called file_name. If str, sets new column name. Default is False. columns: If not None, only these columns will be returned from the output of the read_func_single_partition function. input_meta: A dictionary or a string formatted as a dictionary, which outlines the field names and their respective data types within the JSONL input file.

classmethod read_json(

input_files: str | list[str],

backend: Literal[pandas, datasets.doc_dataset.cudf] = 'pandas',

files_per_partition: int | None = None,

blocksize: str | None = '1gb',

add_filename: bool | str = False,

input_meta: str | dict | None = None,

columns: list[str] | None = None,

**kwargs,

) → datasets.doc_dataset.DocumentDataset#

Read JSONL or JSONL file(s).

Args: input_files: The path of the input file(s). backend: The backend to use for reading the data. files_per_partition: The number of files to read per partition. add_filename: Whether to add a filename column to the DataFrame. If True, a new column is added to the DataFrame called file_name. If str, sets new column name. Default is False. input_meta: A dictionary or a string formatted as a dictionary, which outlines the field names and their respective data types within the JSONL input file. columns: If not None, only these columns will be read from the file.

classmethod read_parquet(

input_files: str | list[str],

backend: Literal[pandas, datasets.doc_dataset.cudf] = 'pandas',

files_per_partition: int | None = None,

blocksize: str | None = '1gb',

add_filename: bool | str = False,

columns: list[str] | None = None,

**kwargs,

) → datasets.doc_dataset.DocumentDataset#

Read Parquet file(s).

Args: input_files: The path of the input file(s). backend: The backend to use for reading the data. files_per_partition: The number of files to read per partition. add_filename: Whether to add a filename column to the DataFrame. If True, a new column is added to the DataFrame called file_name. If str, sets new column name. Default is False. columns: If not None, only these columns will be read from the file. There is a significant performance gain when specifying columns for Parquet files.

classmethod read_pickle(

input_files: str | list[str],

backend: Literal[pandas, datasets.doc_dataset.cudf] = 'pandas',

columns: list[str] | None = None,

**kwargs,

) → datasets.doc_dataset.DocumentDataset#

Read Pickle file(s).

Args: input_files: The path of the input file(s). backend: The backend to use for reading the data. files_per_partition: The number of files to read per partition. add_filename: Whether to add a filename column to the DataFrame. If True, a new column is added to the DataFrame called file_name. If str, sets new column name. Default is False. columns: If not None, only these columns will be read from the file.

repartition(*args, **kwargs) → datasets.doc_dataset.DocumentDataset#

to_json( output_path: str, write_to_filename: bool | str = False, keep_filename_column: bool = False, partition_on: str | None = None, compression: str | None = None, ) → None#

Writes the dataset to the specified path in JSONL format.

If write_to_filename is True, the DataFrame is expected to have a column that specifies the filename for each document. This column can be named file_name by default, or a custom name if write_to_filename is a string.

Args: output_path (str): The directory or file path where the dataset will be written. write_to_filename (Union[bool, str]): Determines how filenames are handled. - If True, uses the file_name column in the DataFrame to determine filenames. - If a string, uses that string as the column name for filenames. - If False, writes all data to the specified output_path. keep_filename_column (bool): If True, retains the filename column in the output. If False, the filename column is dropped from the output. partition_on (Optional[str]): The column name used to partition the data. If specified, data is partitioned based on unique values in this column, with each partition written to a separate directory. compression (Optional[str]): The compression to use for the output file. If specified, the output file will be compressed using the specified compression. Supported compression types are “gzip” or None. For more details, refer to the write_to_disk function in nemo_curator.utils.distributed_utils.

to_pandas() → pandas.DataFrame#

Creates a pandas dataframe from a DocumentDataset

Returns: A pandas dataframe (on the CPU)

to_parquet( output_path: str, write_to_filename: bool | str = False, keep_filename_column: bool = False, partition_on: str | None = None, ) → None#

Writes the dataset to the specified path in Parquet format.

If write_to_filename is True, the DataFrame is expected to have a column that specifies the filename for each document. This column can be named file_name by default, or a custom name if write_to_filename is a string.

Args: output_path (str): The directory or file path where the dataset will be written. write_to_filename (Union[bool, str]): Determines how filenames are handled. - If True, uses the file_name column in the DataFrame to determine filenames. - If a string, uses that string as the column name for filenames. - If False, writes all data to the specified output_path. keep_filename_column (bool): If True, retains the filename column in the output. If False, the filename column is dropped from the output. partition_on (Optional[str]): The column name used to partition the data. If specified, data is partitioned based on unique values in this column, with each partition written to a separate directory.

For more details, refer to the write_to_disk function in nemo_curator.utils.distributed_utils.

to_pickle( output_path: str, write_to_filename: bool | str = False, ) → None#

datasets.doc_dataset#

Module Contents#

Classes#

API#

`datasets.doc_dataset`#