nemo_curator.stages.math.modifiers.merge_chunks

View as Markdown

Module Contents

Classes

NameDescription
ChunkMergeStageMerges chunked documents back into one row per document.

API

class nemo_curator.stages.math.modifiers.merge_chunks.ChunkMergeStage(
text_field: str = 'cleaned_text',
raw_text_field: str | None = 'text',
chunk_id_field: str = 'chunk_id',
groupby_columns: list[str] | None = None,
no_content_markers: list[str] | None = None,
sum_columns: list[str] | None = None,
max_text_length: int = 900000,
separator: str = '\n'
)

Bases: ProcessingStage[DocumentBatch, DocumentBatch]

Merges chunked documents back into one row per document.

After LLM cleanup, the pipeline has multiple rows per document (one per chunk). This stage deduplicates, filters invalid chunks, sorts by chunk order, and concatenates text back into a single row per document.

groupby_columns
= groupby_columns or ['url']
name
= 'chunk_merge'
no_content_markers
sum_columns
nemo_curator.stages.math.modifiers.merge_chunks.ChunkMergeStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.math.modifiers.merge_chunks.ChunkMergeStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.math.modifiers.merge_chunks.ChunkMergeStage.process(
batch: nemo_curator.tasks.DocumentBatch
) -> nemo_curator.tasks.DocumentBatch

Merge chunked rows back into one row per document.