datasets.parallel_dataset
#
Module Contents#
Classes#
An extension of the standard |
API#
- class datasets.parallel_dataset.ParallelDataset(dataset_df: dask.dataframe.DataFrame)#
Bases:
nemo_curator.datasets.doc_dataset.DocumentDataset
An extension of the standard
DocumentDataset
with a special method that loads simple bitext.For data with more complicated metadata, please convert your data into jsonl/parquet/pickle format and use interfaces defined in
DocumentDataset
.Initialization
- persist() datasets.parallel_dataset.ParallelDataset #
- classmethod read_simple_bitext(
- src_input_files: str | list[str],
- tgt_input_files: str | list[str],
- src_lang: str,
- tgt_lang: str,
- backend: str = 'pandas',
- add_filename: bool | str = False,
- npartitions: int = 16,
See
read_single_simple_bitext_file_pair
docstring for what “simple_bitext” means and usage of other parameters.Args: src_input_files (Union[str, List[str]]): one or several input files, in source language tgt_input_files (Union[str, List[str]]): one or several input files, in target language
Raises: TypeError: If types of
src_input_files
andtgt_input_files
doesn’t agree.Returns: ParallelDataset: A
ParallelDataset
object withself.df
holding the ingested simple bitext.
- static read_single_simple_bitext_file_pair(
- input_file_pair: tuple[str],
- src_lang: str,
- tgt_lang: str,
- doc_id: str | None = None,
- backend: str = 'cudf',
- add_filename: bool | str = False,
This function reads a pair of “simple bitext” files into a pandas DataFrame. A simple bitext is a commonly data format in machine translation. It consists of two plain text files with the same number of lines, each line pair being translations of each other. For example:
data.de:
Wir besitzen keine Reisetaschen aus Leder. Die Firma produziert Computer für den deutschen Markt. ...
data.en:
We don't own duffel bags made of leather. The company produces computers for the German market. ...
For simplicity, we also assume that the names of the two text files have the same prefix, except for different language code at the end as file extensions.
Args: input_file_pair (Tuple[str]): A pair of file paths pointing to the input files src_lang (str): Source language, in ISO-639-1 (two character) format (e.g. ‘en’) tgt_lang (str): Target language, in ISO-639-1 (two character) format (e.g. ‘en’) doc_id (str, optional): A string document id to assign to every segment in the file. Defaults to None. backend (str, optional): Backend of the data frame. Defaults to “cudf”. add_filename (Union[bool, str]): Whether to add a filename column to the DataFrame. If True, a new column is added to the DataFrame called
file_name
. If str, sets new column name. Default is False.Returns: Union[dd.DataFrame, dask_cudf.DataFrame]
- to_bitext(
- output_file_dir: str,
- write_to_filename: bool | str = False,
See
nemo_curator.utils.distributed_utils.write_to_disk
docstring for parameter usage.