nemo_curator.stages.math.modifiers.chunking
nemo_curator.stages.math.modifiers.chunking
Module Contents
Classes
API
Bases: ProcessingStage[DocumentBatch, DocumentBatch]
Token-based text chunking stage that splits long texts into smaller chunks while preserving paragraph boundaries.
name
Process a batch of documents and split them into token-based chunks.
Load tokenizer from local cache per worker.
Download model weights to local cache once per physical node.