nemo_curator.stages.text.modules.joiner

View as Markdown

Module Contents

Classes

NameDescription
DocumentJoinerJoins documents that have a common id back into a single document.

API

class nemo_curator.stages.text.modules.joiner.DocumentJoiner(
separator: str = '\n\n',
text_field: str = 'text',
segment_id_field: str = 'segment_id',
document_id_field: str = 'id',
drop_segment_id_field: bool = True,
max_length: int | None = None,
length_field: str | None = None,
name: str = 'document_joiner'
)
Dataclass

Bases: ProcessingStage[DocumentBatch, DocumentBatch]

Joins documents that have a common id back into a single document. The order of the documents is dictated by an additional segment_id column. A maximum length can be specified to limit the size of the joined documents.

The joined documents are joined by a separator.

This stage performs the inverse operation of DocumentSplitter, allowing you to reconstruct documents from their segments.

Parameters:

separator
strDefaults to '\n\n'

The separator to join the documents on.

text_field
strDefaults to 'text'

The name of the column containing the text to join. Defaults to “text”.

segment_id_field
strDefaults to 'segment_id'

The name of the column containing the segment id. Defaults to “segment_id”.

document_id_field
strDefaults to 'id'

The name of the column containing the document id. Defaults to “id”.

drop_segment_id_field
boolDefaults to True

Whether to drop the segment_id_field after joining. Defaults to True.

max_length
intDefaults to None

The maximum length of the joined documents. Both max_length and length_field must be specified or neither can be specified.

length_field
strDefaults to None

The name of the column containing the length of the documents. Both max_length and length_field must be specified or neither can be specified.

document_id_field
str = 'id'
drop_segment_id_field
bool = True
length_field
str | None = None
max_length
int | None = None
name
str = 'document_joiner'
segment_id_field
str = 'segment_id'
separator
str = '\n\n'
text_field
str = 'text'
nemo_curator.stages.text.modules.joiner.DocumentJoiner.__post_init__()
nemo_curator.stages.text.modules.joiner.DocumentJoiner._join_segments(
group: pandas.DataFrame
) -> pandas.DataFrame

Join segments with max_length constraint.

nemo_curator.stages.text.modules.joiner.DocumentJoiner.inputs() -> tuple[list[str], list[str]]

Define stage input requirements.

nemo_curator.stages.text.modules.joiner.DocumentJoiner.outputs() -> tuple[list[str], list[str]]

Define stage output specification.

nemo_curator.stages.text.modules.joiner.DocumentJoiner.process(
batch: nemo_curator.tasks.DocumentBatch
) -> nemo_curator.tasks.DocumentBatch

Joins the documents back into a single document while preserving all the original fields.

Parameters:

batch
DocumentBatch

Input batch to process

Returns: DocumentBatch

Batch with documents joined by document_id