nemo_curator.stages.text.modules.splitter
nemo_curator.stages.text.modules.splitter
Module Contents
Classes
API
Bases: ProcessingStage[DocumentBatch, DocumentBatch]
Splits documents into segments based on a separator. Each segment becomes a new row within the batch with an additional column indicating the segment id.
To restore the original document, ensure that each document has a unique id prior to splitting.
Parameters:
The separator to split the documents on.
The name of the column containing the text to split. Defaults to “text”.
The name of the column to add to indicate the segment id. Defaults to “segment_id”.
Define stage input requirements.
Define stage output specification.
Splits the documents into segments based on the separator and adds a column indicating the segment id.
Parameters:
Input batch to process
Returns: DocumentBatch
Batch with documents split into segments