nemo_curator.stages.text.experimental.translation.stages.reassembly

View as Markdown

Reassemble translated segments back into document rows.

Module Contents

Classes

NameDescription
ReassemblyStageCollapse segment rows back into one row per document.

Data

_FAITH_SCORE_COLUMNS

_INTERNAL_COLUMNS

API

class nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage(
name: str = 'ReassemblyStage',
text_field: str = 'text',
output_field: str = 'translated_text',
replace_source_fields: bool = False,
emit_metadata_helpers: bool = False,
aggregate_faith_scores: bool = False
)
Dataclass

Bases: ProcessingStage[DocumentBatch, DocumentBatch]

Collapse segment rows back into one row per document.

aggregate_faith_scores
bool = False
emit_metadata_helpers
bool = False
name
str = 'ReassemblyStage'
output_field
str = 'translated_text'
replace_source_fields
bool = False
text_field
str = 'text'
nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage._average_faith_scores(
segment_scores: list[dict[str, float]]
) -> dict[str, float]
staticmethod

Average FAITH scores across segments, ignoring zero-valued dimensions.

nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage._base_output_row(
group: pandas.DataFrame
) -> dict[str, typing.Any]

Create the common output row fields for one document group.

nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage._build_reassembled_row(
group: pandas.DataFrame
) -> dict[str, typing.Any]

Build one output document row from its segment group.

nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage._build_segment_pairs(
metadata: dict[str, typing.Any],
translated_segments: list[str]
) -> list[dict[str, str]]
staticmethod

Build [{src, tgt}, ...] pairs for one field entry.

nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage._build_skip_row(
out_row: dict[str, typing.Any],
group: pandas.DataFrame
) -> dict[str, typing.Any]

Build passthrough output for rows marked as skipped.

nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage._build_translation_maps(
metadata: dict[str, typing.Any],
translated_segments: list[str],
out_row: dict[str, typing.Any]
) -> tuple[dict[str, typing.Any], dict[str, typing.Any]]

Reassemble translated text and return metadata helper maps.

nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage._collect_multi_field_outputs(
metadata: dict[str, typing.Any],
translated_segments: list[str]
) -> tuple[dict[str, list[str]], dict[str, typing.Any], dict[str, typing.Any], int]

Collect per-field reassembled text and helper maps.

nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage._compute_faith_avg(
scores: dict[str, float]
) -> float
staticmethod

Compute faith_avg as the mean of non-zero FAITH dimensions.

nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage._count_segments_in_meta(
fm: dict[str, typing.Any]
) -> int
staticmethod

Count the translatable segments expected by one field entry.

nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage._leaf_field_key(
field_path: str
) -> str
staticmethod

Return the metadata key for field_path.

nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage._reassemble_coarse(
metadata: dict[str, typing.Any],
translated_segments: list[str]
) -> str
staticmethod

Reconstruct a document from coarse-mode metadata.

nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage._reassemble_fine(
metadata: dict[str, typing.Any],
translated_segments: list[str]
) -> str
staticmethod

Reconstruct a document from fine-mode metadata.

nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage._reassemble_multi_field(
metadata: dict[str, typing.Any],
translated_segments: list[str],
out_row: dict[str, typing.Any]
) -> tuple[dict[str, typing.Any], dict[str, typing.Any]]

Reassemble one or more translated field paths.

nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage._reassemble_single_field(
metadata: dict[str, typing.Any],
translated_segments: list[str],
out_row: dict[str, typing.Any]
) -> tuple[dict[str, typing.Any], dict[str, typing.Any]]

Reassemble a single translated field.

nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage._warn_for_unconsumed_segments(
seg_offset: int,
translated_segments: list[str]
) -> None
staticmethod

Log when multi-field metadata did not consume all translated segments.

nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage._write_aggregated_faith_scores(
out_row: dict[str, typing.Any],
group: pandas.DataFrame
) -> None
classmethod

Aggregate segment-level FAITH scores into one document-level record.

nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage._write_multi_field_payload(
out_row: dict[str, typing.Any],
reassembled_by_path: dict[str, list[str]],
translation_map: dict[str, typing.Any]
) -> None

Write reassembled multi-field output payload into out_row.

nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage._write_nested_field_payload(
out_row: dict[str, typing.Any],
field_path: str,
texts: list[str]
) -> object

Write a nested or wildcard field payload.

nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage._write_one_field_payload(
out_row: dict[str, typing.Any],
field_path: str,
texts: list[str]
) -> object

Write one reassembled field and return its output payload value.

nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.experimental.translation.stages.reassembly.ReassemblyStage.process(
batch: nemo_curator.tasks.document.DocumentBatch
) -> nemo_curator.tasks.document.DocumentBatch

Reassemble translated segments into full documents.

nemo_curator.stages.text.experimental.translation.stages.reassembly._FAITH_SCORE_COLUMNS = {'faith_fluency': 'Fluency', 'faith_accuracy': 'Accuracy', 'faith_idiomaticity':...
nemo_curator.stages.text.experimental.translation.stages.reassembly._INTERNAL_COLUMNS = {'_seg_segments', '_seg_metadata', '_seg_doc_id', '_translated', '_translation_t...