nemo_curator.stages.text.experimental.translation.stages.segmentation

View as Markdown

SegmentationStage — splits documents into translatable segments.

Supports two modes:

  • coarse — line-level splitting with code-block awareness.
  • fine — sentence-level splitting via spaCy with exact-structure preservation.

Multi-field and wildcard-path support allows translating nested structures such as messages.*.content without manual flattening.

Module Contents

Classes

NameDescription
SegmentationStageSplit documents into translatable segments.

Functions

NameDescription
_append_stripped_unitAppend a text unit while preserving leading/trailing whitespace.
_get_spacy_nlpLazy-load a spaCy model for the given source language.
_resolve_spacy_model_nameResolve the spaCy model name for the given language.
_spacy_units_with_separatorsReturn spaCy sentence text plus the exact following separator.
_split_unit_on_special_separatorsSplit one spaCy unit on custom separators while preserving structure.
is_line_translatable_contentDetermine whether line contains translatable content.
split_into_sentences_with_structureSplit text using spaCy, then apply custom regex patterns while preserving exact structure.

Data

SPACY_FALLBACK_MODEL

SPACY_LANG_MODELS

_nlp_cache

API

class nemo_curator.stages.text.experimental.translation.stages.segmentation.SegmentationStage(
name: str = 'SegmentationStage',
source_lang: str,
text_field: str | list[str] = 'text',
mode: str = 'coarse',
min_segment_chars: int = 0,
skipme_field: str | None = None
)
Dataclass

Bases: ProcessingStage[DocumentBatch, DocumentBatch]

Split documents into translatable segments.

Each input row is exploded into N output rows (one per translatable segment). Reconstruction metadata is stored as a JSON string in the _seg_metadata column so that :class:ReassemblyStage can later collapse the rows back into whole documents.

min_segment_chars
int = 0
mode
str = 'coarse'
name
str = 'SegmentationStage'
skipme_field
str | None = None
source_lang
str
text_field
str | list[str] = 'text'
nemo_curator.stages.text.experimental.translation.stages.segmentation.SegmentationStage.__post_init__() -> None
nemo_curator.stages.text.experimental.translation.stages.segmentation.SegmentationStage._build_metadata_json(
field_metadatas: list[dict[str, typing.Any]]
) -> str
staticmethod

Serialize the per-field metadata envelope for one source document.

nemo_curator.stages.text.experimental.translation.stages.segmentation.SegmentationStage._build_output_rows(
original_cols: dict[str, typing.Any],
segments: list[str],
metadata_json: str,
doc_idx: int
) -> list[dict[str, typing.Any]]
staticmethod

Create exploded output rows for the segmented document.

nemo_curator.stages.text.experimental.translation.stages.segmentation.SegmentationStage._build_skip_output_row(
original_cols: dict[str, typing.Any],
doc_idx: int
) -> dict[str, typing.Any] | None

Return a passthrough row when skipme_field marks the document.

nemo_curator.stages.text.experimental.translation.stages.segmentation.SegmentationStage._collect_document_segments(
row: pandas.Series,
field_paths: list[str]
) -> tuple[list[str], list[dict[str, typing.Any]]]

Collect translated segments and metadata for all requested field paths.

nemo_curator.stages.text.experimental.translation.stages.segmentation.SegmentationStage._extract_texts(
row: pandas.Series,
field_path: str
) -> list[str]
staticmethod

Extract translatable text(s) from a row given a field_path.

If field_path is a simple column name (no wildcard), the column value is returned directly. If it is a wildcard dot-path, the root column is parsed as structured data (dict or JSON string) and :func:extract_nested_fields is used to pull matching string values.

Parameters:

row
pd.Series

A single DataFrame row.

field_path
str

A plain column name or a wildcard dot-path.

Returns: list[str]

A list of string texts extracted from the row.

nemo_curator.stages.text.experimental.translation.stages.segmentation.SegmentationStage._segment_coarse(
text: str
) -> tuple[list[str], str]

Line-level segmentation with code-block awareness.

Returns: list[str]

A tuple of (segments, metadata_json) where segments is a list of

nemo_curator.stages.text.experimental.translation.stages.segmentation.SegmentationStage._segment_document(
row: pandas.Series,
field_paths: list[str],
doc_idx: int
) -> tuple[list[dict[str, typing.Any]], int]

Segment a single source document and emit exploded output rows.

nemo_curator.stages.text.experimental.translation.stages.segmentation.SegmentationStage._segment_fine(
text: str
) -> tuple[list[str], str]

Sentence-level segmentation via spaCy with exact-structure preservation.

Returns: list[str]

A tuple of (segments, metadata_json) where segments is a list of

nemo_curator.stages.text.experimental.translation.stages.segmentation.SegmentationStage._segment_text(
text: str,
field_path: str
) -> tuple[list[str], dict[str, typing.Any]]

Segment one extracted text value and attach its field path.

nemo_curator.stages.text.experimental.translation.stages.segmentation.SegmentationStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.experimental.translation.stages.segmentation.SegmentationStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.experimental.translation.stages.segmentation.SegmentationStage.process(
batch: nemo_curator.tasks.document.DocumentBatch
) -> nemo_curator.tasks.document.DocumentBatch

Segment each document into translatable units.

For each row in batch.data:

  1. If skipme_field is set and the row is flagged, pass through with an empty segment.
  2. Resolve text_field — may be a plain column, a wildcard path into structured data, or a list of paths (multi-field).
  3. Apply coarse or fine segmentation to each extracted text.
  4. Explode: one output row per segment.
nemo_curator.stages.text.experimental.translation.stages.segmentation._append_stripped_unit(
units: list[tuple[str, str]],
text_unit: str,
separator: str
) -> None

Append a text unit while preserving leading/trailing whitespace.

nemo_curator.stages.text.experimental.translation.stages.segmentation._get_spacy_nlp(
src_lang: str = 'en',
max_length: int | None = None
) -> object

Lazy-load a spaCy model for the given source language.

Parameters:

src_lang
strDefaults to 'en'

ISO 639-1 language code (e.g. 'en', 'de', 'hi').

max_length
int | NoneDefaults to None

Optional override for nlp.max_length on the cached instance created for this call.

Returns: object

A loaded spaCy Language model appropriate for src_lang.

nemo_curator.stages.text.experimental.translation.stages.segmentation._resolve_spacy_model_name(
src_lang: str = 'en'
) -> str

Resolve the spaCy model name for the given language.

nemo_curator.stages.text.experimental.translation.stages.segmentation._spacy_units_with_separators(
text: str,
spacy_sentences: list[object]
) -> list[tuple[str, str]]

Return spaCy sentence text plus the exact following separator.

nemo_curator.stages.text.experimental.translation.stages.segmentation._split_unit_on_special_separators(
sent_text: str,
sent_separator: str,
special_separator_pattern: str
) -> list[tuple[str, str]]

Split one spaCy unit on custom separators while preserving structure.

nemo_curator.stages.text.experimental.translation.stages.segmentation.is_line_translatable_content(
line: str
) -> bool

Determine whether line contains translatable content.

Returns False for lines that have no alphabetic characters or that look like XML/HTML tags (e.g. <tag>). Structured JSON blobs are also treated as non-translatable so tool payloads and machine-readable content are preserved verbatim.

nemo_curator.stages.text.experimental.translation.stages.segmentation.split_into_sentences_with_structure(
text: str,
src_lang: str = 'en'
) -> list[tuple[str, str]]

Split text using spaCy, then apply custom regex patterns while preserving exact structure.

Returns a list of (sentence_text, separator_after) tuples such that "".join(t + s for t, s in result) reconstructs the original text.

Parameters:

text
str

The text to split into sentences.

src_lang
strDefaults to 'en'

ISO 639-1 language code for loading the appropriate spaCy model.

nemo_curator.stages.text.experimental.translation.stages.segmentation.SPACY_FALLBACK_MODEL: str = 'xx_sent_ud_sm'
nemo_curator.stages.text.experimental.translation.stages.segmentation.SPACY_LANG_MODELS: dict[str, str] = {'en': 'en_core_web_sm', 'de': 'de_core_news_sm', 'fr': 'fr_core_news_sm', 'es':...
nemo_curator.stages.text.experimental.translation.stages.segmentation._nlp_cache: dict[tuple[str, int | None], object] = {}