`stages.text.modules.splitter`#

Module Contents#

Classes#

DocumentSplitter

Splits documents into segments based on a separator. Each segment becomes a new row within the batch with an additional column indicating the segment id.

API#

class stages.text.modules.splitter.DocumentSplitter#

Bases: nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks.DocumentBatch, nemo_curator.tasks.DocumentBatch]

Splits documents into segments based on a separator. Each segment becomes a new row within the batch with an additional column indicating the segment id.

To restore the original document, ensure that each document has a unique id prior to splitting.

Example: If a document has text=”Hello\n\nWorld”, and separator=”\n\n”, it will be split into two rows: one with text=”Hello” and segment_id=0, and another with text=”World” and segment_id=1.

Args: separator (str): The separator to split the documents on. text_field (str): The name of the column containing the text to split. Defaults to “text”. segment_id_field (str): The name of the column to add to indicate the segment id. Defaults to “segment_id”.

inputs() → tuple[list[str], list[str]]#: Define stage input requirements.

name: str#: ‘document_splitter’

outputs() → tuple[list[str], list[str]]#: Define stage output specification.

process( batch: nemo_curator.tasks.DocumentBatch, ) → nemo_curator.tasks.DocumentBatch#

Splits the documents into segments based on the separator and adds a column indicating the segment id.

Args: batch (DocumentBatch): Input batch to process

Returns: DocumentBatch: Batch with documents split into segments

segment_id_field: str#: ‘segment_id’

separator: str#: None

text_field: str#: ‘text’

stages.text.modules.splitter#

Module Contents#

Classes#

API#

`stages.text.modules.splitter`#