stages.text.modules.splitter#

Module Contents#

Classes#

DocumentSplitter

Splits documents into segments based on a separator. Each segment becomes a new row within the batch with an additional column indicating the segment id.

API#

class stages.text.modules.splitter.DocumentSplitter#

Bases: nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks.DocumentBatch, nemo_curator.tasks.DocumentBatch]

Splits documents into segments based on a separator. Each segment becomes a new row within the batch with an additional column indicating the segment id.

To restore the original document, ensure that each document has a unique id prior to splitting.

Example: If a document has text=”Hello\n\nWorld”, and separator=”\n\n”, it will be split into two rows: one with text=”Hello” and segment_id=0, and another with text=”World” and segment_id=1.

Args: separator (str): The separator to split the documents on. text_field (str): The name of the column containing the text to split. Defaults to “text”. segment_id_field (str): The name of the column to add to indicate the segment id. Defaults to “segment_id”.

inputs() tuple[list[str], list[str]]#

Define stage input requirements.

name: str#

‘document_splitter’

outputs() tuple[list[str], list[str]]#

Define stage output specification.

process(
batch: nemo_curator.tasks.DocumentBatch,
) nemo_curator.tasks.DocumentBatch#

Splits the documents into segments based on the separator and adds a column indicating the segment id.

Args: batch (DocumentBatch): Input batch to process

Returns: DocumentBatch: Batch with documents split into segments

segment_id_field: str#

‘segment_id’

separator: str#

None

text_field: str#

‘text’