Translation YAML Reference#

The translate/nemo_curator step ships src/nemotron/steps/translate/nemo_curator/config/default.yaml as the canonical starter profile. This page lists top-level keys you can override with nemotron steps run translate/nemo_curator key=value dotlists, grouped by concern, with the full baseline file inlined below.

Default Configuration File#

# Starter config for NeMo Curator corpus translation.

run:
  env:
    mounts:
      - ${auto_mount:git+https://github.com/NVIDIA-NeMo/Curator.git@d10cd6ffe9f5ac4cbb176d7b3ada698f22633aea,/opt/Curator}

input_path: /path/to/filtered_data.jsonl
output_dir: ./output/translated

# Required. Ask the user; do not infer silently.
source_language: ???
target_language: ???

input_format: auto        # auto | jsonl | parquet
output_format: jsonl      # jsonl | parquet
backend: llm              # llm | nmt | google | aws

text_field: messages.*.content
output_field: translated_text
translation_column: translated_text
output_mode: both         # replaced | raw | both
merge_scores: true
reconstruct_messages: true
messages_field: messages
messages_content_field: content

segmentation_mode: coarse # coarse | fine
min_segment_chars: 0
max_concurrent_requests: 64
generation_config: null   # Optional OpenAI-compatible translation generation settings.
skip_translated: false
files_per_partition: null
blocksize: null

server:
  url: https://integrate.api.nvidia.com/v1
  model: ""              # Required for backend=llm and used by FAITH unless faith_eval.model_name is set.
  api_key_env: NVIDIA_API_KEY
  api_key: ""

faith_eval:
  enabled: true
  threshold: 2.5
  model_name: ""
  filter_enabled: true
  max_concurrent_requests: 64
  generation_config:
    max_tokens: 2048
    temperature: 0.0

nmt:
  server_url: http://localhost:5000
  batch_size: 32
  timeout: 120
  max_concurrent_requests: 32

google:
  project_id: ""
  location: global
  api_version: v2
  max_concurrent_requests: 32

aws:
  region: us-east-2
  max_concurrent_requests: 32

Keys Grouped by Concern#

Paths and Formats#

Key

Description

input_path

File, glob, or homogeneous directory consumed by JsonlReader or ParquetReader.

output_dir

Directory passed to JsonlWriter or ParquetWriter in overwrite mode.

input_format

auto, jsonl, or parquet.

output_format

jsonl or parquet.

Languages and Backend#

Key

Description

source_language / target_language

Required ISO 639-1 codes. Empty placeholders remind operators to set values explicitly.

backend

llm, nmt, google, or aws.

Translation Semantics#

Key

Description

text_field

Dot or wildcard path describing strings to translate. The default is messages.*.content.

output_field, translation_column

Destination columns for translated text and downstream merges.

output_mode

replaced, raw, or both.

merge_scores

Attach FAITH outputs adjacent to translations when enabled.

reconstruct_messages, messages_field, messages_content_field

Chat reconstruction switches.

segmentation_mode, min_segment_chars

Segmenter behavior. Values include coarse and fine.

max_concurrent_requests, skip_translated, files_per_partition, blocksize

Throughput and partitioning controls surfaced to Curator readers and clients.

LLM Fields#

Used whenever backend=llm or FAITH needs an OpenAI-compatible judge.

Key

Description

server.url

Chat-completions compatible base URL.

server.model

Model identifier. Required for llm translation and for FAITH unless you override the scorer model.

server.api_key_env

Environment variable housing the API secret. The default is NVIDIA_API_KEY.

server.api_key

Inline secret. Discouraged for shared repositories.

FAITH Evaluation#

Key

Description

enabled

Turns FAITH scoring on. The starter YAML sets this to true.

threshold

Minimum acceptable faith_avg on a one-to-five scale. The starter default 2.5 is a permissive noisy-data floor. See FAITH Evaluation Inside Translation for the full rubric.

model_name

Optional scorer-only model. Defaults to server.model.

filter_enabled

Drop failing rows when true.

max_concurrent_requests

Optional scorer-side concurrency limit.

generation_config

Optional OpenAI-compatible generation settings for the scorer.

Backend-Specific Blocks#

Block

When needed

nmt

HTTP microservice URL, batching, timeouts.

google

Project metadata and API version. Version v3 requires project_id.

aws

Region plus concurrency limits.

Overrides#

OmegaConf dotlists merge last:

uv run nemotron steps run translate/nemo_curator -c default \
  backend=nmt \
  nmt.server_url=http://localhost:5000 \
  faith_eval.enabled=false \
  input_path=/data/chat.jsonl \
  output_dir=/data/out \
  source_language=en \
  target_language=hi