Translation
Use NeMo Curator’s translation package to translate flat text fields or structured records, such as chat conversations stored under messages.*.content.
Translation is currently an experimental text stage. Import it from nemo_curator.stages.text.experimental.translation; APIs and output details may change while the workflow is being validated.
The experimental translation package is centered on TranslationStage, which composes segmentation, translation, reassembly, output formatting, and optional evaluation into a reusable text-processing stage.
Capabilities
- Translate a single text field such as
text, a nested field path such asmetadata.body, or wildcard paths such asmessages.*.content - Preserve machine-readable payloads, including valid JSON objects and arrays, instead of sending them to the translation model
- Emit translated output in
replaced,raw, orbothmodes - Emit segmented translation mappings for inspection or downstream evaluation
- Run FAITH scoring on translated text with
FaithEvalFilter - Score forward and reverse translation quality with
TextQualityMetricStage
Before You Start
- Install translation extras as needed:
translation_commonfor basic translation supporttranslation_metricsforTextQualityMetricStagetranslation_segmentationforsegmentation_mode="fine"translation_aws,translation_google, ortranslation_nmtfor those backendstranslation_allfor the full translation feature set
- For
backend_type="llm", configure an OpenAI-compatible async client. Refer to LLM Client Setup. - If
enable_faith_eval=True, configure an LLM client and scoring model even when translation itself uses a non-LLM backend. - For non-LLM backends, use one of the built-in backend types:
google,aws, ornmt. - Input data is typically newline-delimited JSON with a
textfield or another field referenced throughtext_field.
Basic Translation Pipeline
The example below reads JSONL files, translates messages.*.content from English to Hindi, runs FAITH scoring, and writes the results back to JSONL.
source_lang and target_lang are required. Curator does not assume default translation languages.
Structured Translation Behavior
Structured translation works directly on nested records.
- Setting
text_field="messages.*.content"extracts every message content string from the record - Valid JSON objects and arrays are treated as non-translatable content and are preserved verbatim
- When
output_mode="replaced", translated values are written back into the original field path - When
output_mode="raw"oroutput_mode="both", Curator emitstranslation_metadatawith whole-text and segmented mappings
This makes the pipeline suitable for chat-style records where natural-language turns should be translated but tool payloads should remain untouched.
Segmentation and Output Control
TranslationStage exposes a few important controls:
segmentation_mode="coarse"keeps line-level splitting with code-block awarenesssegmentation_mode="fine"uses sentence-level segmentation with structure preservationmin_segment_charsbypasses segmentation for short textenable_faith_eval=Trueruns FAITH on exploded segment rows before reassembly, which avoids long-context scoring requestsreconstruct_messages=Truerebuilds translated message lists for structured chat-style inputs
DocumentBatch Walkthrough
TranslationStage operates on a DocumentBatch, which is a task wrapper around a pandas DataFrame or Arrow table.
The wrapper fields stay mostly constant across stages:
task_iddataset_name_stage_perf_metadata
The main thing that changes is the DataFrame inside DocumentBatch.data.
Worked Example
Assume a single input row and this pipeline configuration:
Input row:
1. SegmentationStage
The batch changes from one row per document to one row per translatable segment.
New columns:
_seg_segments_seg_metadata_seg_doc_id
Example output:
Important details:
- Valid JSON payloads and fenced code blocks are not emitted as translatable segment rows.
_seg_doc_idties all segment rows back to the same source document._seg_metadatais a JSON reconstruction template duplicated across the segment rows for that document.
For coarse segmentation, the metadata contains the original non-translated lines plus placeholders for translatable lines. For fine segmentation, it stores sentence-like units and their separators.
2. SegmentTranslationStage
The batch still has one row per segment, but each segment now gets its translated text and per-segment runtime/error data.
New columns:
_translated_translation_time_translation_error
Example output:
3. FaithEvalFilter
When enable_faith_eval=True, FAITH runs on the exploded segment rows before reassembly.
New columns:
faith_fluencyfaith_accuracyfaith_idiomaticityfaith_terminologyfaith_handling_of_formatfaith_avgfaith_parse_failed
Each row is scored independently. Filtering does not happen yet, because dropping segment rows before reassembly would corrupt the reconstructed document.
4. ReassemblyStage
The batch collapses back to one row per original document by grouping on _seg_doc_id.
Removed internal columns:
_seg_segments_seg_metadata_seg_doc_id_translated_translation_time_translation_error
Added document-level columns:
translated_texttranslation_timetranslation_errors_translation_map_segmented_translation_mapfaith_segment_scoreswhen FAITH is enabled- aggregated
faith_*columns when FAITH is enabled
Example output:
Important details:
translation_timeis the sum of the segment-level translation times.translation_errorsjoins any non-empty segment errors._translation_mapand_segmented_translation_mapare helper columns used later to buildtranslation_metadata.- When FAITH is enabled, reassembly averages the per-segment FAITH scores into document-level
faith_*columns and writes the raw per-segment list tofaith_segment_scores. - For structured fields such as
messages.*.content, reassembly writes translations back into the nested structure instead of only returning a flat string.
5. FaithThresholdFilterStage
When FAITH filtering is enabled, the threshold filter runs after reassembly on the aggregated FAITH score.
Rows with faith_avg < faith_threshold are dropped here. Rows with parse failures or no scored segments are preserved.
6. FormatTranslationOutputStage
When output_mode="raw" or output_mode="both", this stage builds:
translation_metadata
translation_metadata contains:
target_langtranslationsegmented_translation
When output_mode="raw", the final translated text column is dropped after metadata is constructed. This is useful when downstream consumers want metadata-rich output without keeping a separate top-level translated text column.
This stage also drops internal helper columns such as:
_translation_map_segmented_translation_map
7. MergeFaithScoresStage
When merge_scores=True, FAITH scores are merged into the existing translation_metadata JSON under faith_scores.
At that point, the final row contains:
- the original source columns
- translation runtime and error columns
- FAITH score columns
translation_metadata
Skip/Restore Path
When skip_translated=True, the pipeline inserts two additional stages:
SkipExistingTranslationsStageRestoreSkippedRowsStage
SkipExistingTranslationsStage removes rows that already have a non-empty translation column from the DataFrame and stores them temporarily in DocumentBatch._metadata["_skipped_rows_state"].
RestoreSkippedRowsStage restores those rows later, fills in any missing score/metadata columns with defaults, and sorts them back into the original row order.
Quality Evaluation
FAITH Scoring
Enable FAITH scoring inside the translation pipeline when you want model-based adequacy checks on translated output:
FAITH scores are merged into the output when output_mode="raw" or output_mode="both".
Round-Trip Metrics
Backtranslation uses a second translation pass with reversed languages, followed by TextQualityMetricStage:
Supported metric types are:
sacrebleuchrfter
Backend Selection
Use backend_type to switch between translation backends:
llm: OpenAI-compatible async clientgoogle: Google translation backendaws: AWS translation backendnmt: NMT service backend
For non-LLM backends, pass backend-specific settings through backend_config.
Notes
- The translation package is designed for pipeline execution. Avoid converting large datasets to pandas on the driver just to orchestrate translation.
- For structured inputs, wildcard paths and nested paths are first-class inputs to the library. You do not need to flatten records manually before calling
TranslationStage.