modules.splitter#

Module Contents#

Classes#

DocumentSplitter

Splits documents into segments based on a separator. Each segment is a new document with an additional column indicating the segment id.

API#

class modules.splitter.DocumentSplitter(
separator: str,
text_field: str = 'text',
segment_id_field: str = 'segment_id',
)#

Bases: nemo_curator.modules.base.BaseModule

Splits documents into segments based on a separator. Each segment is a new document with an additional column indicating the segment id.

To restore the original document, ensure that each document has a unique id prior to splitting.

Initialization

Args: separator (str): The separator to split the documents on. text_field (str): The name of the column containing the text to split. segment_id_field (str): The name of the column to add to indicate the segment id.

call(
dataset: nemo_curator.datasets.DocumentDataset,
) nemo_curator.datasets.DocumentDataset#

Splits the documents into segments based on the separator and adds a column indicating the segment id.