modules.joiner#

Module Contents#

Classes#

DocumentJoiner

Joins documents that have a common id back into a single document. The order of the documents is dictated by an additional segment_id column. A maximum length can be specified to limit the size of the joined documents.

API#

class modules.joiner.DocumentJoiner(
separator: str,
text_field: str = 'text',
segment_id_field: str = 'segment_id',
document_id_field: str = 'id',
drop_segment_id_field: bool = True,
max_length: int | None = None,
length_field: str | None = None,
)#

Bases: nemo_curator.modules.base.BaseModule

Joins documents that have a common id back into a single document. The order of the documents is dictated by an additional segment_id column. A maximum length can be specified to limit the size of the joined documents.

The joined documents are joined by a separator.

Initialization

Args: separator (str): The separator to join the documents on. text_field (str): The name of the column containing the text to join. segment_id_field (str): The name of the column containing the segment id. document_id_field (str): The name of the column containing the document id. drop_segment_id_field (bool): Whether to drop the segment_id_field after joining. max_length (int, optional): The maximum length of the joined documents. Both max_length and length_field must be specified or neither can be specified. length_field (str, optional): The name of the column containing the length of the documents. Both max_length and length_field must be specified or neither can be specified.

call(
dataset: nemo_curator.datasets.DocumentDataset,
) nemo_curator.datasets.DocumentDataset#

Joins the documents back into a single document while preserving all the original fields.