Datasets#

DocumentDataset#

class nemo_curator.datasets.DocumentDataset(dataset_df: dask.dataframe.DataFrame)#

A collection of documents and document metadata. Internally it may be distributed across multiple nodes, and may be on GPUs.

classmethod from_pandas( data: pandas.DataFrame, npartitions: int | None = 1, chunksize: int | None = None, sort: bool | None = True, name: str | None = None, ) → DocumentDataset#

Creates a document dataset from a pandas data frame. For more information on the arguments see Dask’s from_pandas documentation https://docs.dask.org/en/stable/generated/dask.dataframe.from_pandas.html

Parameters:: data – A pandas dataframe
Returns:: A document dataset with a pandas backend (on the CPU).

classmethod read_custom(

input_files: str | list[str],

file_type: str,

read_func_single_partition: Callable[[list[str], str, bool, str | dict, dict], cudf.DataFrame | pandas.DataFrame],

files_per_partition: int | None = None,

backend: Literal['pandas', 'cudf'] | None = None,

add_filename: bool | str = False,

columns: list[str] | None = None,

input_meta: str | dict | None = None,

**kwargs,

) → DocumentDataset#

Read custom data from a file or directory based on a custom read function.

Parameters:

input_files – The path of the input file(s). If input_file is a string that ends with the file_type, we consider it as a single file. If input_file is a string that does not end with the file_type, we consider it as a directory and read all files under the directory. If input_file is a list of strings, we assume each string is a file path.
file_type – The type of the file to read.
read_func_single_partition – A function that reads a single file or a list of files in an single Dask partition. The function should take the following arguments: - files: A list of file paths. - file_type: The type of the file to read (in case you want to handle different file types differently). - backend: Read below - add_filename: Read below - columns: Read below - input_meta: Read below
backend – The backend to use for reading the data, in case you want to handle pd.DataFrame or cudf.DataFrame.
files_per_partition – The number of files to read per partition.
add_filename – Whether to add a filename column to the DataFrame. If True, a new column is added to the DataFrame called file_name. If str, sets new column name. Default is False.
columns – If not None, only these columns will be returned from the output of the read_func_single_partition function.
input_meta – A dictionary or a string formatted as a dictionary, which outlines the field names and their respective data types within the JSONL input file.

Read JSONL or JSONL file(s).

Parameters:

input_files – The path of the input file(s).
backend – The backend to use for reading the data.
files_per_partition – The number of files to read per partition.
add_filename – Whether to add a filename column to the DataFrame. If True, a new column is added to the DataFrame called file_name. If str, sets new column name. Default is False.
input_meta – A dictionary or a string formatted as a dictionary, which outlines the field names and their respective data types within the JSONL input file.
columns – If not None, only these columns will be read from the file.

classmethod read_parquet(

input_files: str | list[str],

backend: Literal['pandas', 'cudf'] = 'pandas',

files_per_partition: int | None = None,

blocksize: str | None = '1gb',

add_filename: bool | str = False,

columns: list[str] | None = None,

**kwargs,

) → DocumentDataset#

Read Parquet file(s).

Parameters:

input_files – The path of the input file(s).
backend – The backend to use for reading the data.
files_per_partition – The number of files to read per partition.
add_filename – Whether to add a filename column to the DataFrame. If True, a new column is added to the DataFrame called file_name. If str, sets new column name. Default is False.
columns – If not None, only these columns will be read from the file. There is a significant performance gain when specifying columns for Parquet files.

classmethod read_pickle(

input_files: str | list[str],

backend: Literal['pandas', 'cudf'] = 'pandas',

columns: list[str] | None = None,

**kwargs,

) → DocumentDataset#

Read Pickle file(s).

Parameters:

input_files – The path of the input file(s).
backend – The backend to use for reading the data.
files_per_partition – The number of files to read per partition.
add_filename – Whether to add a filename column to the DataFrame. If True, a new column is added to the DataFrame called file_name. If str, sets new column name. Default is False.
columns – If not None, only these columns will be read from the file.

to_json( output_path: str, write_to_filename: bool | str = False, keep_filename_column: bool = False, partition_on: str | None = None, compression: str | None = None, ) → None#

Writes the dataset to the specified path in JSONL format.

If write_to_filename is True, the DataFrame is expected to have a column that specifies the filename for each document. This column can be named file_name by default, or a custom name if write_to_filename is a string.

Parameters:

output_path (str) – The directory or file path where the dataset will be written.
write_to_filename (Union[bool, str]) – Determines how filenames are handled. - If True, uses the file_name column in the DataFrame to determine filenames. - If a string, uses that string as the column name for filenames. - If False, writes all data to the specified output_path.
keep_filename_column (bool) – If True, retains the filename column in the output. If False, the filename column is dropped from the output.
partition_on (Optional[str]) – The column name used to partition the data. If specified, data is partitioned based on unique values in this column, with each partition written to a separate directory.
compression (Optional[str]) – The compression to use for the output file. If specified, the output file will be compressed using the specified compression. Supported compression types are “gzip” or None.

For more details, refer to the write_to_disk function in nemo_curator.utils.distributed_utils.

to_pandas() → pandas.DataFrame#

Creates a pandas dataframe from a DocumentDataset

Returns:: A pandas dataframe (on the CPU)

to_parquet( output_path: str, write_to_filename: bool | str = False, keep_filename_column: bool = False, partition_on: str | None = None, ) → None#

Writes the dataset to the specified path in Parquet format.

If write_to_filename is True, the DataFrame is expected to have a column that specifies the filename for each document. This column can be named file_name by default, or a custom name if write_to_filename is a string.

Parameters:

output_path (str) – The directory or file path where the dataset will be written.
write_to_filename (Union[bool, str]) – Determines how filenames are handled. - If True, uses the file_name column in the DataFrame to determine filenames. - If a string, uses that string as the column name for filenames. - If False, writes all data to the specified output_path.
keep_filename_column (bool) – If True, retains the filename column in the output. If False, the filename column is dropped from the output.
partition_on (Optional[str]) – The column name used to partition the data. If specified, data is partitioned based on unique values in this column, with each partition written to a separate directory.

For more details, refer to the write_to_disk function in nemo_curator.utils.distributed_utils.

class nemo_curator.datasets.ParallelDataset(dataset_df: dask.dataframe.DataFrame)#

An extension of the standard DocumentDataset with a special method that loads simple bitext.

For data with more complicated metadata, please convert your data into jsonl/parquet/pickle format and use interfaces defined in DocumentDataset.

classmethod read_simple_bitext( src_input_files: str | list[str], tgt_input_files: str | list[str], src_lang: str, tgt_lang: str, backend: str = 'pandas', add_filename: bool | str = False, npartitions: int = 16, ) → ParallelDataset#

See read_single_simple_bitext_file_pair docstring for what “simple_bitext” means and usage of other parameters.

Parameters:

src_input_files (Union[str, List[str]]) – one or several input files, in source language
tgt_input_files (Union[str, List[str]]) – one or several input files, in target language

Raises:

TypeError – If types of src_input_files and tgt_input_files doesn’t agree.

Returns:

A ParallelDataset object with self.df holding the ingested simple bitext.

Return type:

ParallelDataset

static read_single_simple_bitext_file_pair( input_file_pair: tuple[str], src_lang: str, tgt_lang: str, doc_id: str | None = None, backend: str = 'cudf', add_filename: bool | str = False, ) → dask.dataframe.DataFrame | dask_cudf.DataFrame#

This function reads a pair of “simple bitext” files into a pandas DataFrame. A simple bitext is a commonly data format in machine translation. It consists of two plain text files with the same number of lines, each line pair being translations of each other. For example:

data.de:

` Wir besitzen keine Reisetaschen aus Leder. Die Firma produziert Computer für den deutschen Markt. ... `

data.en:

` We don't own duffel bags made of leather. The company produces computers for the German market. ... `

For simplicity, we also assume that the names of the two text files have the same prefix, except for different language code at the end as file extensions.

Parameters:

input_file_pair (Tuple[str]) – A pair of file paths pointing to the input files
src_lang (str) – Source language, in ISO-639-1 (two character) format (e.g. ‘en’)
tgt_lang (str) – Target language, in ISO-639-1 (two character) format (e.g. ‘en’)
doc_id (str, optional) – A string document id to assign to every segment in the file. Defaults to None.
backend (str, optional) – Backend of the data frame. Defaults to “cudf”.
add_filename (Union[bool, str]) – Whether to add a filename column to the DataFrame. If True, a new column is added to the DataFrame called file_name. If str, sets new column name. Default is False.

Returns:

Union[dd.DataFrame, dask_cudf.DataFrame]

to_bitext( output_file_dir: str, write_to_filename: bool | str = False, ) → None#: See nemo_curator.utils.distributed_utils.write_to_disk docstring for parameter usage.

ImageTextPairDataset#

class nemo_curator.datasets.ImageTextPairDataset( path: str, metadata: dask.dataframe.DataFrame, tar_files: list[str], id_col: str, )#

A collection of image text pairs stored in WebDataset-like format on disk or in cloud storage.

The exact format assumes a single directory with sharded .tar, .parquet, and (optionally) .idx files. Each tar file should have a unique integer ID as its name (00000.tar, 00001.tar, 00002.tar, etc.). The tar files should contain images in .jpg files, text captions in .txt files, and metadata in .json files. Each record of the dataset is identified by a unique ID that is a mix of the shard ID along with the offset of the record within a shard. For example, the 32nd record of the 43rd shard would be in 00042.tar and have image 000420031.jpg, caption 000420031.txt, and metadata 000420031.json (assuming zero indexing).

In addition to the collection of tar files, ImageTextPairDataset expects there to be .parquet files in the root directory that follow the same naming convention as the shards (00042.tar -> 00042.parquet). Each Parquet file should contain an aggregated tabular form of the metadata for each record, with each row in the Parquet file corresponding to a record in that shard. The metadata, both in the Parquet files and the JSON files, must contain a unique ID column that is the same as its record ID (000420031 in our examples).

Index files may also be in the directory to speed up dataloading with DALI. The index files must be generated by DALI’s wds2idx tool. See https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/general/data_loading/dataloading_webdataset.html#Creating-an-index for more information. Each index file must follow the same naming convention as the tar files (00042.tar -> 00042.idx).

classmethod from_webdataset( path: str, id_col: str, ) → ImageTextPairDataset#

Loads an ImageTextPairDataset from a WebDataset

Parameters:

path (str) – The path to the WebDataset-like format on disk or cloud storage.
id_col (str) – The column storing the unique identifier for each record.

save_metadata( path: str | None = None, columns: list[str] | None = None, ) → None#

Saves the metadata of the dataset to the specified path as a collection of Parquet files.

Parameters:

path (Optional[str]) – The path to save the metadata to. If None, writes to the original path.
columns (Optional[List[str]]) – If specified, only saves a subset of columns.

to_webdataset( path: str, filter_column: str, samples_per_shard: int = 10000, max_shards: int = 5, old_id_col: str | None = None, ) → None#

Saves the dataset to a WebDataset format with Parquet files. Will reshard the tar files to the specified number of samples per shard. The ID value in ImageTextPairDataset.id_col will be overwritten with a new ID.

Parameters:

path (str) – The output path where the dataset should be written.
filter_column (str) – A column of booleans. All samples with a value of True in this column will be included in the output. Otherwise, the sample will be omitted.
samples_per_shard (int) – The number of samples to include in each tar file.
max_shards (int) – The order of magnitude of the maximum number of shards that will be created from the dataset. Will be used to determine the number of leading zeros in the shard/sample IDs.
old_id_col (Optional[str]) – If specified, will preserve the previous ID value in the given column.