modules.splitter#
Module Contents#
Classes#
| Splits documents into segments based on a separator. Each segment is a new document with an additional column indicating the segment id. | 
API#
- class modules.splitter.DocumentSplitter(
- separator: str,
- text_field: str = 'text',
- segment_id_field: str = 'segment_id',
- Bases: - nemo_curator.modules.base.BaseModule- Splits documents into segments based on a separator. Each segment is a new document with an additional column indicating the segment id. - To restore the original document, ensure that each document has a unique id prior to splitting. - Initialization - Args: separator (str): The separator to split the documents on. text_field (str): The name of the column containing the text to split. segment_id_field (str): The name of the column to add to indicate the segment id. - call(
- dataset: nemo_curator.datasets.DocumentDataset,
- Splits the documents into segments based on the separator and adds a column indicating the segment id.