stages.text.modules.joiner#

Module Contents#

Classes#

DocumentJoiner

Joins documents that have a common id back into a single document. The order of the documents is dictated by an additional segment_id column. A maximum length can be specified to limit the size of the joined documents.

API#

class stages.text.modules.joiner.DocumentJoiner#

Bases: nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks.DocumentBatch, nemo_curator.tasks.DocumentBatch]

Joins documents that have a common id back into a single document. The order of the documents is dictated by an additional segment_id column. A maximum length can be specified to limit the size of the joined documents.

The joined documents are joined by a separator.

This stage performs the inverse operation of DocumentSplitter, allowing you to reconstruct documents from their segments.

Important: This stage assumes that all segments belonging to the same document are contained within a single DocumentBatch. Segments from the same document split across multiple batches will NOT be joined together. Ensure your batching logic keeps all segments of a document together.

Example: If you have segments with document_id=1, segment_id=[0,1] and text=[“Hello”, “World”], they will be joined into a single row with document_id=1 and text=”Hello\n\nWorld” (assuming separator=”\n\n”).

Args: separator (str): The separator to join the documents on. text_field (str): The name of the column containing the text to join. Defaults to “text”. segment_id_field (str): The name of the column containing the segment id. Defaults to “segment_id”. document_id_field (str): The name of the column containing the document id. Defaults to “id”. drop_segment_id_field (bool): Whether to drop the segment_id_field after joining. Defaults to True. max_length (int, optional): The maximum length of the joined documents. Both max_length and length_field must be specified or neither can be specified. length_field (str, optional): The name of the column containing the length of the documents. Both max_length and length_field must be specified or neither can be specified.

document_id_field: str#

‘id’

drop_segment_id_field: bool#

True

inputs() tuple[list[str], list[str]]#

Define stage input requirements.

length_field: str | None#

None

max_length: int | None#

None

name: str#

‘document_joiner’

outputs() tuple[list[str], list[str]]#

Define stage output specification.

process(
batch: nemo_curator.tasks.DocumentBatch,
) nemo_curator.tasks.DocumentBatch#

Joins the documents back into a single document while preserving all the original fields.

Args: batch (DocumentBatch): Input batch to process

Returns: DocumentBatch: Batch with documents joined by document_id

segment_id_field: str#

‘segment_id’

separator: str = <Multiline-String>#
text_field: str#

‘text’