nemo_curator.stages.math.modifiers.chunking

View as Markdown

Module Contents

Classes

NameDescription
TokenSplitterStageToken-based text chunking stage that splits long texts into smaller chunks

API

class nemo_curator.stages.math.modifiers.chunking.TokenSplitterStage(
model_name: str,
max_length_tokens: int = 8000,
separator: str = '\n\n',
text_field: str = 'text',
chunk_id_field: str = 'chunk_id',
n_tokens_field: str = 'n_tokens'
)

Bases: ProcessingStage[DocumentBatch, DocumentBatch]

Token-based text chunking stage that splits long texts into smaller chunks while preserving paragraph boundaries.

name
nemo_curator.stages.math.modifiers.chunking.TokenSplitterStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.math.modifiers.chunking.TokenSplitterStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.math.modifiers.chunking.TokenSplitterStage.process(
batch: nemo_curator.tasks.DocumentBatch
) -> nemo_curator.tasks.DocumentBatch

Process a batch of documents and split them into token-based chunks.

nemo_curator.stages.math.modifiers.chunking.TokenSplitterStage.setup(
_worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None

Load tokenizer from local cache per worker.

nemo_curator.stages.math.modifiers.chunking.TokenSplitterStage.setup_on_node(
_node_info: nemo_curator.backends.base.NodeInfo | None = None,
_worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None

Download model weights to local cache once per physical node.