nemo_curator.stages.text.experimental.translation.stages.segmentation
nemo_curator.stages.text.experimental.translation.stages.segmentation
SegmentationStage — splits documents into translatable segments.
Supports two modes:
- coarse — line-level splitting with code-block awareness.
- fine — sentence-level splitting via spaCy with exact-structure preservation.
Multi-field and wildcard-path support allows translating nested structures
such as messages.*.content without manual flattening.
Module Contents
Classes
Functions
Data
API
Bases: ProcessingStage[DocumentBatch, DocumentBatch]
Split documents into translatable segments.
Each input row is exploded into N output rows (one per translatable
segment). Reconstruction metadata is stored as a JSON string in the
_seg_metadata column so that :class:ReassemblyStage can later
collapse the rows back into whole documents.
Serialize the per-field metadata envelope for one source document.
Create exploded output rows for the segmented document.
Return a passthrough row when skipme_field marks the document.
Collect translated segments and metadata for all requested field paths.
Extract translatable text(s) from a row given a field_path.
If field_path is a simple column name (no wildcard), the column
value is returned directly. If it is a wildcard dot-path, the root
column is parsed as structured data (dict or JSON string) and
:func:extract_nested_fields is used to pull matching string values.
Parameters:
A single DataFrame row.
A plain column name or a wildcard dot-path.
Returns: list[str]
A list of string texts extracted from the row.
Line-level segmentation with code-block awareness.
Returns: list[str]
A tuple of (segments, metadata_json) where segments is a list of
Segment a single source document and emit exploded output rows.
Sentence-level segmentation via spaCy with exact-structure preservation.
Returns: list[str]
A tuple of (segments, metadata_json) where segments is a list of
Segment one extracted text value and attach its field path.
Segment each document into translatable units.
For each row in batch.data:
- If
skipme_fieldis set and the row is flagged, pass through with an empty segment. - Resolve
text_field— may be a plain column, a wildcard path into structured data, or a list of paths (multi-field). - Apply coarse or fine segmentation to each extracted text.
- Explode: one output row per segment.
Append a text unit while preserving leading/trailing whitespace.
Lazy-load a spaCy model for the given source language.
Parameters:
ISO 639-1 language code (e.g. 'en', 'de', 'hi').
Optional override for nlp.max_length on the cached
instance created for this call.
Returns: object
A loaded spaCy Language model appropriate for src_lang.
Resolve the spaCy model name for the given language.
Return spaCy sentence text plus the exact following separator.
Split one spaCy unit on custom separators while preserving structure.
Determine whether line contains translatable content.
Returns False for lines that have no alphabetic characters or that
look like XML/HTML tags (e.g. <tag>). Structured JSON blobs are also
treated as non-translatable so tool payloads and machine-readable content
are preserved verbatim.
Split text using spaCy, then apply custom regex patterns while preserving exact structure.
Returns a list of (sentence_text, separator_after) tuples such that
"".join(t + s for t, s in result) reconstructs the original text.
Parameters:
The text to split into sentences.
ISO 639-1 language code for loading the appropriate spaCy model.