nemo_curator.stages.text.modules.splitter

View as Markdown

Module Contents

Classes

NameDescription
DocumentSplitterSplits documents into segments based on a separator.

API

class nemo_curator.stages.text.modules.splitter.DocumentSplitter(
separator: str,
text_field: str = 'text',
segment_id_field: str = 'segment_id',
name: str = 'document_splitter'
)
Dataclass

Bases: ProcessingStage[DocumentBatch, DocumentBatch]

Splits documents into segments based on a separator. Each segment becomes a new row within the batch with an additional column indicating the segment id.

To restore the original document, ensure that each document has a unique id prior to splitting.

Parameters:

separator
str

The separator to split the documents on.

text_field
strDefaults to 'text'

The name of the column containing the text to split. Defaults to “text”.

segment_id_field
strDefaults to 'segment_id'

The name of the column to add to indicate the segment id. Defaults to “segment_id”.

name
str = 'document_splitter'
segment_id_field
str = 'segment_id'
separator
str
text_field
str = 'text'
nemo_curator.stages.text.modules.splitter.DocumentSplitter.inputs() -> tuple[list[str], list[str]]

Define stage input requirements.

nemo_curator.stages.text.modules.splitter.DocumentSplitter.outputs() -> tuple[list[str], list[str]]

Define stage output specification.

nemo_curator.stages.text.modules.splitter.DocumentSplitter.process(
batch: nemo_curator.tasks.DocumentBatch
) -> nemo_curator.tasks.DocumentBatch

Splits the documents into segments based on the separator and adds a column indicating the segment id.

Parameters:

batch
DocumentBatch

Input batch to process

Returns: DocumentBatch

Batch with documents split into segments