stages.text.modules.splitter#
Module Contents#
Classes#
Splits documents into segments based on a separator. Each segment becomes a new row within the batch with an additional column indicating the segment id. |
API#
- class stages.text.modules.splitter.DocumentSplitter#
Bases:
nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks.DocumentBatch,nemo_curator.tasks.DocumentBatch]Splits documents into segments based on a separator. Each segment becomes a new row within the batch with an additional column indicating the segment id.
To restore the original document, ensure that each document has a unique id prior to splitting.
Example: If a document has text=”Hello\n\nWorld”, and separator=”\n\n”, it will be split into two rows: one with text=”Hello” and segment_id=0, and another with text=”World” and segment_id=1.
Args: separator (str): The separator to split the documents on. text_field (str): The name of the column containing the text to split. Defaults to “text”. segment_id_field (str): The name of the column to add to indicate the segment id. Defaults to “segment_id”.
- inputs() tuple[list[str], list[str]]#
Define stage input requirements.
- name: str#
‘document_splitter’
- outputs() tuple[list[str], list[str]]#
Define stage output specification.
- process(
- batch: nemo_curator.tasks.DocumentBatch,
Splits the documents into segments based on the separator and adds a column indicating the segment id.
Args: batch (DocumentBatch): Input batch to process
Returns: DocumentBatch: Batch with documents split into segments
- segment_id_field: str#
‘segment_id’
- separator: str#
None
- text_field: str#
‘text’