datasets.parallel_dataset#

Module Contents#

Classes#

ParallelDataset

An extension of the standard DocumentDataset with a special method that loads simple bitext.

API#

class datasets.parallel_dataset.ParallelDataset(dataset_df: dask.dataframe.DataFrame)#

Bases: nemo_curator.datasets.doc_dataset.DocumentDataset

An extension of the standard DocumentDataset with a special method that loads simple bitext.

For data with more complicated metadata, please convert your data into jsonl/parquet/pickle format and use interfaces defined in DocumentDataset.

Initialization

persist() datasets.parallel_dataset.ParallelDataset#
classmethod read_simple_bitext(
src_input_files: str | list[str],
tgt_input_files: str | list[str],
src_lang: str,
tgt_lang: str,
backend: str = 'pandas',
add_filename: bool | str = False,
npartitions: int = 16,
) datasets.parallel_dataset.ParallelDataset#

See read_single_simple_bitext_file_pair docstring for what “simple_bitext” means and usage of other parameters.

Args: src_input_files (Union[str, List[str]]): one or several input files, in source language tgt_input_files (Union[str, List[str]]): one or several input files, in target language

Raises: TypeError: If types of src_input_files and tgt_input_files doesn’t agree.

Returns: ParallelDataset: A ParallelDataset object with self.df holding the ingested simple bitext.

static read_single_simple_bitext_file_pair(
input_file_pair: tuple[str],
src_lang: str,
tgt_lang: str,
doc_id: str | None = None,
backend: str = 'cudf',
add_filename: bool | str = False,
) dask.dataframe.DataFrame | dask_cudf.DataFrame#

This function reads a pair of “simple bitext” files into a pandas DataFrame. A simple bitext is a commonly data format in machine translation. It consists of two plain text files with the same number of lines, each line pair being translations of each other. For example:

data.de:

Wir besitzen keine Reisetaschen aus Leder.
Die Firma produziert Computer für den deutschen Markt.
...

data.en:

We don't own duffel bags made of leather.
The company produces computers for the German market.
...

For simplicity, we also assume that the names of the two text files have the same prefix, except for different language code at the end as file extensions.

Args: input_file_pair (Tuple[str]): A pair of file paths pointing to the input files src_lang (str): Source language, in ISO-639-1 (two character) format (e.g. ‘en’) tgt_lang (str): Target language, in ISO-639-1 (two character) format (e.g. ‘en’) doc_id (str, optional): A string document id to assign to every segment in the file. Defaults to None. backend (str, optional): Backend of the data frame. Defaults to “cudf”. add_filename (Union[bool, str]): Whether to add a filename column to the DataFrame. If True, a new column is added to the DataFrame called file_name. If str, sets new column name. Default is False.

Returns: Union[dd.DataFrame, dask_cudf.DataFrame]

to_bitext(
output_file_dir: str,
write_to_filename: bool | str = False,
) None#

See nemo_curator.utils.distributed_utils.write_to_disk docstring for parameter usage.