Translation YAML Reference#

The translate/nemo_curator step ships src/nemotron/steps/translate/nemo_curator/config/default.yaml as the canonical starter profile. This page lists top-level keys you can override with nemotron steps run translate/nemo_curator key=value dotlists, grouped by concern, with the full baseline file inlined below.

Default Configuration File#

# Starter config for NeMo Curator corpus translation.

run:
  env:
    mounts:
      - ${auto_mount:git+https://github.com/NVIDIA-NeMo/Curator.git@d10cd6ffe9f5ac4cbb176d7b3ada698f22633aea,/opt/Curator}

input_path: /path/to/filtered_data.jsonl
output_dir: ./output/translated

# Required. Ask the user; do not infer silently.
source_language: ???
target_language: ???

input_format: auto        # auto | jsonl | parquet
output_format: jsonl      # jsonl | parquet
backend: llm              # llm | nmt | google | aws

text_field: messages.*.content
output_field: translated_text
translation_column: translated_text
output_mode: both         # replaced | raw | both
merge_scores: true
reconstruct_messages: true
messages_field: messages
messages_content_field: content

segmentation_mode: coarse # coarse | fine
min_segment_chars: 0
translation_prompt_path: null # Optional absolute local YAML prompt path.
max_concurrent_requests: 64
health_check: true
dry_run: false
dry_run_log_count: 5
generation_config: null   # Optional OpenAI-compatible translation generation settings.
skip_translated: false
files_per_partition: null
blocksize: null

server:
  url: https://integrate.api.nvidia.com/v1
  model: ""              # Required for backend=llm and used by FAITH unless faith_eval.model_name is set.
  api_key_env: NVIDIA_API_KEY
  api_key: ""

faith_eval:
  enabled: true
  threshold: 2.5
  model_name: ""
  filter_enabled: true
  prompt_path: null # Optional absolute local YAML prompt path.
  max_concurrent_requests: 64
  generation_config:
    max_tokens: 2048
    temperature: 0.0

nmt:
  server_url: http://localhost:5000
  batch_size: 32
  timeout: 120
  max_concurrent_requests: 32

google:
  project_id: ""
  location: global
  api_version: v2
  max_concurrent_requests: 32

aws:
  region: us-east-2
  max_concurrent_requests: 32

Keys Grouped by Concern#

Paths and Formats#

Key	Description
`input_path`	File, glob, or homogeneous directory consumed by `JsonlReader` or `ParquetReader`.
`output_dir`	Directory passed to `JsonlWriter` or `ParquetWriter` in overwrite mode.
`input_format`	`auto`, `jsonl`, or `parquet`.
`output_format`	`jsonl` or `parquet`.

Languages and Backend#

Key	Description
`source_language` / `target_language`	Required ISO 639-1 codes. Empty placeholders remind operators to set values explicitly.
`backend`	`llm`, `nmt`, `google`, or `aws`.

Translation Semantics#

Key	Description
`text_field`	Dot or wildcard path describing strings to translate. The default is `messages.*.content`.
`output_field`, `translation_column`	Destination columns for translated text and downstream merges.
`output_mode`	`replaced`, `raw`, or `both`.
`merge_scores`	Attach FAITH outputs adjacent to translations when enabled.
`reconstruct_messages`, `messages_field`, `messages_content_field`	Chat reconstruction switches.
`segmentation_mode`, `min_segment_chars`	Segmenter behavior. Values include `coarse` and `fine`.
`max_concurrent_requests`, `skip_translated`, `files_per_partition`, `blocksize`	Throughput and partitioning controls surfaced to Curator readers and clients.

LLM Fields#

Used whenever backend=llm or FAITH needs an OpenAI-compatible judge.

Key	Description
`server.url`	Chat-completions compatible base URL.
`server.model`	Model identifier. Required for `llm` translation and for FAITH unless you override the scorer model.
`server.api_key_env`	Environment variable housing the API secret. The default is `NVIDIA_API_KEY`.
`server.api_key`	Inline secret. Discouraged for shared repositories.

FAITH Evaluation#

Key	Description
`enabled`	Turns FAITH scoring on. The starter YAML sets this to `true`.
`threshold`	Minimum acceptable `faith_avg` on a one-to-five scale. The starter default `2.5` is a permissive noisy-data floor. See FAITH Evaluation Inside Translation for the full rubric.
`model_name`	Optional scorer-only model. Defaults to `server.model`.
`filter_enabled`	Drop failing rows when `true`.
`max_concurrent_requests`	Optional scorer-side concurrency limit.
`generation_config`	Optional OpenAI-compatible generation settings for the scorer.

Backend-Specific Blocks#

Block	When needed
`nmt`	HTTP microservice URL, batching, timeouts.
`google`	Project metadata and API version. Version `v3` requires `project_id`.
`aws`	Region plus concurrency limits.

Overrides#

OmegaConf dotlists merge last:

uv run nemotron steps run translate/nemo_curator -c default \
  backend=nmt \
  nmt.server_url=http://localhost:5000 \
  faith_eval.enabled=false \
  input_path=/data/chat.jsonl \
  output_dir=/data/out \
  source_language=en \
  target_language=hi