modules.splitter
#
Module Contents#
Classes#
Splits documents into segments based on a separator. Each segment is a new document with an additional column indicating the segment id. |
API#
- class modules.splitter.DocumentSplitter(
- separator: str,
- text_field: str = 'text',
- segment_id_field: str = 'segment_id',
Bases:
nemo_curator.modules.base.BaseModule
Splits documents into segments based on a separator. Each segment is a new document with an additional column indicating the segment id.
To restore the original document, ensure that each document has a unique id prior to splitting.
Initialization
Args: separator (str): The separator to split the documents on. text_field (str): The name of the column containing the text to split. segment_id_field (str): The name of the column to add to indicate the segment id.
- call(
- dataset: nemo_curator.datasets.DocumentDataset,
Splits the documents into segments based on the separator and adds a column indicating the segment id.