Translation | NeMo Curator

Use NeMo Curator’s translation package to translate flat text fields or structured records, such as chat conversations stored under messages.*.content.

Translation is currently an experimental text stage. Import it from nemo_curator.stages.text.experimental.translation; APIs and output details may change while the workflow is being validated.

The experimental translation package is centered on TranslationStage, which composes segmentation, translation, reassembly, output formatting, and optional evaluation into a reusable text-processing stage.

Capabilities

Translate a single text field such as text, a nested field path such as metadata.body, or wildcard paths such as messages.*.content
Preserve machine-readable payloads, including valid JSON objects and arrays, instead of sending them to the translation model
Emit translated output in replaced, raw, or both modes
Emit segmented translation mappings for inspection or downstream evaluation
Run FAITH scoring on translated text with FaithEvalFilter
Score forward and reverse translation quality with TextQualityMetricStage

Before You Start

Install translation extras as needed:
- translation_common for basic translation support
- translation_metrics for TextQualityMetricStage
- translation_segmentation for segmentation_mode="fine"
- translation_aws, translation_google, or translation_nmt for those backends
- translation_all for the full translation feature set
For backend_type="llm", configure an OpenAI-compatible async client. Refer to LLM Client Setup.
If enable_faith_eval=True, configure an LLM client and scoring model even when translation itself uses a non-LLM backend.
For non-LLM backends, use one of the built-in backend types: google, aws, or nmt.
Input data is typically newline-delimited JSON with a text field or another field referenced through text_field.

Basic Translation Pipeline

The example below reads JSONL files, translates messages.*.content from English to Hindi, runs FAITH scoring, and writes the results back to JSONL.

1 import os
2 
3 from nemo_curator.models.client.llm_client import GenerationConfig
4 from nemo_curator.models.client.openai_client import AsyncOpenAIClient
5 from nemo_curator.pipeline import Pipeline
6 from nemo_curator.stages.text.experimental.translation import TranslationStage
7 from nemo_curator.stages.text.io.reader import JsonlReader
8 from nemo_curator.stages.text.io.writer import JsonlWriter
9 
10 client = AsyncOpenAIClient(
11     api_key=os.environ["NVIDIA_API_KEY"],
12     base_url="https://integrate.api.nvidia.com/v1",
13     max_concurrent_requests=8,
14 )
15 
16 pipeline = Pipeline(name="translate_chat_dataset")
17 pipeline.add_stage(JsonlReader(file_paths="input/*.jsonl"))
18 pipeline.add_stage(
19     TranslationStage(
20         client=client,
21         model_name="openai/gpt-oss-120b",
22         generation_config=GenerationConfig(max_tokens=2048),
23         source_lang="en",
24         target_lang="hi",
25         text_field="messages.*.content",
26         output_field="translated_text",
27         output_mode="both",
28         reconstruct_messages=True,
29         enable_faith_eval=True,
30         faith_threshold=2.5,
31     )
32 )
33 pipeline.add_stage(JsonlWriter(path="translated/"))
34 results = pipeline.run()

source_lang and target_lang are required. Curator does not assume default translation languages.

Structured Translation Behavior

Structured translation works directly on nested records.

Setting text_field="messages.*.content" extracts every message content string from the record
Valid JSON objects and arrays are treated as non-translatable content and are preserved verbatim
When output_mode="replaced", translated values are written back into the original field path
When output_mode="raw" or output_mode="both", Curator emits translation_metadata with whole-text and segmented mappings

This makes the pipeline suitable for chat-style records where natural-language turns should be translated but tool payloads should remain untouched.

Segmentation and Output Control

TranslationStage exposes a few important controls:

segmentation_mode="coarse" keeps line-level splitting with code-block awareness
segmentation_mode="fine" uses sentence-level segmentation with structure preservation
min_segment_chars bypasses segmentation for short text
enable_faith_eval=True runs FAITH on exploded segment rows before reassembly, which avoids long-context scoring requests
reconstruct_messages=True rebuilds translated message lists for structured chat-style inputs

DocumentBatch Walkthrough

TranslationStage operates on a DocumentBatch, which is a task wrapper around a pandas DataFrame or Arrow table.

The wrapper fields stay mostly constant across stages:

task_id
dataset_name
_stage_perf
_metadata

The main thing that changes is the DataFrame inside DocumentBatch.data.

Worked Example

Assume a single input row and this pipeline configuration:

1 TranslationStage(
2     client=client,
3     model_name="openai/gpt-oss-120b",
4     text_field="text",
5     source_lang="en",
6     target_lang="hi",
7     segmentation_mode="coarse",
8     enable_faith_eval=True,
9     output_mode="raw",
10     merge_scores=True,
11 )

Input row:

id | text
7  | Explain grouped-query attention.
   | {"tool":"search","query":"GQA"}
   | ```python
   | print("hello")
   | ```
   | It reduces KV-cache memory.

1. SegmentationStage

The batch changes from one row per document to one row per translatable segment.

New columns:

_seg_segments
_seg_metadata
_seg_doc_id

Example output:

id | _seg_segments
7  | Explain grouped-query attention.
7  | It reduces KV-cache memory.

Important details:

Valid JSON payloads and fenced code blocks are not emitted as translatable segment rows.
_seg_doc_id ties all segment rows back to the same source document.
_seg_metadata is a JSON reconstruction template duplicated across the segment rows for that document.

For coarse segmentation, the metadata contains the original non-translated lines plus placeholders for translatable lines. For fine segmentation, it stores sentence-like units and their separators.

2. SegmentTranslationStage

The batch still has one row per segment, but each segment now gets its translated text and per-segment runtime/error data.

New columns:

_translated
_translation_time
_translation_error

Example output:

id | _seg_segments                    | _translated
7  | Explain grouped-query attention. | समूहित-क्वेरी अटेंशन समझाइए।
7  | It reduces KV-cache memory.      | यह KV-cache मेमोरी को कम करता है।

3. FaithEvalFilter

When enable_faith_eval=True, FAITH runs on the exploded segment rows before reassembly.

New columns:

faith_fluency
faith_accuracy
faith_idiomaticity
faith_terminology
faith_handling_of_format
faith_avg
faith_parse_failed

Each row is scored independently. Filtering does not happen yet, because dropping segment rows before reassembly would corrupt the reconstructed document.

4. ReassemblyStage

The batch collapses back to one row per original document by grouping on _seg_doc_id.

Removed internal columns:

_seg_segments
_seg_metadata
_seg_doc_id
_translated
_translation_time
_translation_error

Added document-level columns:

translated_text
translation_time
translation_errors
_translation_map
_segmented_translation_map
faith_segment_scores when FAITH is enabled
aggregated faith_* columns when FAITH is enabled

Example output:

id | translated_text
7  | समूहित-क्वेरी अटेंशन समझाइए।
   | {"tool":"search","query":"GQA"}
   | ```python
   | print("hello")
   | ```
   | यह KV-cache मेमोरी को कम करता है।

Important details:

translation_time is the sum of the segment-level translation times.
translation_errors joins any non-empty segment errors.
_translation_map and _segmented_translation_map are helper columns used later to build translation_metadata.
When FAITH is enabled, reassembly averages the per-segment FAITH scores into document-level faith_* columns and writes the raw per-segment list to faith_segment_scores.
For structured fields such as messages.*.content, reassembly writes translations back into the nested structure instead of only returning a flat string.

5. FaithThresholdFilterStage

When FAITH filtering is enabled, the threshold filter runs after reassembly on the aggregated FAITH score.

Rows with faith_avg < faith_threshold are dropped here. Rows with parse failures or no scored segments are preserved.

6. FormatTranslationOutputStage

When output_mode="raw" or output_mode="both", this stage builds:

translation_metadata

translation_metadata contains:

target_lang
translation
segmented_translation

When output_mode="raw", the final translated text column is dropped after metadata is constructed. This is useful when downstream consumers want metadata-rich output without keeping a separate top-level translated text column.

This stage also drops internal helper columns such as:

_translation_map
_segmented_translation_map

7. MergeFaithScoresStage

When merge_scores=True, FAITH scores are merged into the existing translation_metadata JSON under faith_scores.

At that point, the final row contains:

the original source columns
translation runtime and error columns
FAITH score columns
translation_metadata

Skip/Restore Path

When skip_translated=True, the pipeline inserts two additional stages:

SkipExistingTranslationsStage
RestoreSkippedRowsStage

SkipExistingTranslationsStage removes rows that already have a non-empty translation column from the DataFrame and stores them temporarily in DocumentBatch._metadata["_skipped_rows_state"].

RestoreSkippedRowsStage restores those rows later, fills in any missing score/metadata columns with defaults, and sorts them back into the original row order.

Quality Evaluation

FAITH Scoring

Enable FAITH scoring inside the translation pipeline when you want model-based adequacy checks on translated output:

1 TranslationStage(
2     client=client,
3     model_name="openai/gpt-oss-120b",
4     source_lang="en",
5     target_lang="de",
6     text_field="text",
7     enable_faith_eval=True,
8     faith_threshold=2.5,
9 )

FAITH scores are merged into the output when output_mode="raw" or output_mode="both".

Round-Trip Metrics

Backtranslation uses a second translation pass with reversed languages, followed by TextQualityMetricStage:

1 from nemo_curator.stages.text.experimental.translation import TextQualityMetricStage, TranslationStage
2 
3 pipeline.add_stage(
4     TranslationStage(
5         client=client,
6         model_name="openai/gpt-oss-120b",
7         source_lang="hi",
8         target_lang="en",
9         text_field="translated_text",
10         output_field="backtranslated_text",
11         output_mode="both",
12     )
13 )
14 pipeline.add_stage(
15     TextQualityMetricStage(
16         reference_text_field="text",
17         hypothesis_text_field="backtranslated_text",
18         metrics=[
19             {"type": "sacrebleu", "threshold": 20.0},
20             {"type": "chrf", "threshold": 40.0},
21         ],
22     )
23 )

Supported metric types are:

sacrebleu
chrf
ter

Backend Selection

Use backend_type to switch between translation backends:

llm: OpenAI-compatible async client
google: Google translation backend
aws: AWS translation backend
nmt: NMT service backend

For non-LLM backends, pass backend-specific settings through backend_config.

Notes

The translation package is designed for pipeline execution. Avoid converting large datasets to pandas on the driver just to orchestrate translation.
For structured inputs, wildcard paths and nested paths are first-class inputs to the library. You do not need to flatten records manually before calling TranslationStage.