nemo_curator.stages.text.modules.joiner
nemo_curator.stages.text.modules.joiner
Module Contents
Classes
API
Bases: ProcessingStage[DocumentBatch, DocumentBatch]
Joins documents that have a common id back into a single document. The order of the documents is dictated by an additional segment_id column. A maximum length can be specified to limit the size of the joined documents.
The joined documents are joined by a separator.
This stage performs the inverse operation of DocumentSplitter, allowing you to reconstruct documents from their segments.
Parameters:
The separator to join the documents on.
The name of the column containing the text to join. Defaults to “text”.
The name of the column containing the segment id. Defaults to “segment_id”.
The name of the column containing the document id. Defaults to “id”.
Whether to drop the segment_id_field after joining. Defaults to True.
The maximum length of the joined documents. Both max_length and length_field must be specified or neither can be specified.
The name of the column containing the length of the documents. Both max_length and length_field must be specified or neither can be specified.
Join segments with max_length constraint.
Define stage input requirements.
Define stage output specification.
Joins the documents back into a single document while preserving all the original fields.
Parameters:
Input batch to process
Returns: DocumentBatch
Batch with documents joined by document_id