nemo_curator.stages.math.modifiers.merge_chunks
nemo_curator.stages.math.modifiers.merge_chunks
Module Contents
Classes
API
Bases: ProcessingStage[DocumentBatch, DocumentBatch]
Merges chunked documents back into one row per document.
After LLM cleanup, the pipeline has multiple rows per document (one per chunk). This stage deduplicates, filters invalid chunks, sorts by chunk order, and concatenates text back into a single row per document.
groupby_columns
name
no_content_markers
sum_columns
Merge chunked rows back into one row per document.