Translation YAML Reference#
The translate/nemo_curator step ships src/nemotron/steps/translate/nemo_curator/config/default.yaml as the canonical starter profile. This page lists top-level keys you can override with nemotron steps run translate/nemo_curator key=value dotlists, grouped by concern, with the full baseline file inlined below.
Default Configuration File#
# Starter config for NeMo Curator corpus translation.
run:
env:
mounts:
- ${auto_mount:git+https://github.com/NVIDIA-NeMo/Curator.git@d10cd6ffe9f5ac4cbb176d7b3ada698f22633aea,/opt/Curator}
input_path: /path/to/filtered_data.jsonl
output_dir: ./output/translated
# Required. Ask the user; do not infer silently.
source_language: ???
target_language: ???
input_format: auto # auto | jsonl | parquet
output_format: jsonl # jsonl | parquet
backend: llm # llm | nmt | google | aws
text_field: messages.*.content
output_field: translated_text
translation_column: translated_text
output_mode: both # replaced | raw | both
merge_scores: true
reconstruct_messages: true
messages_field: messages
messages_content_field: content
segmentation_mode: coarse # coarse | fine
min_segment_chars: 0
max_concurrent_requests: 64
generation_config: null # Optional OpenAI-compatible translation generation settings.
skip_translated: false
files_per_partition: null
blocksize: null
server:
url: https://integrate.api.nvidia.com/v1
model: "" # Required for backend=llm and used by FAITH unless faith_eval.model_name is set.
api_key_env: NVIDIA_API_KEY
api_key: ""
faith_eval:
enabled: true
threshold: 2.5
model_name: ""
filter_enabled: true
max_concurrent_requests: 64
generation_config:
max_tokens: 2048
temperature: 0.0
nmt:
server_url: http://localhost:5000
batch_size: 32
timeout: 120
max_concurrent_requests: 32
google:
project_id: ""
location: global
api_version: v2
max_concurrent_requests: 32
aws:
region: us-east-2
max_concurrent_requests: 32
Keys Grouped by Concern#
Paths and Formats#
Key |
Description |
|---|---|
|
File, glob, or homogeneous directory consumed by |
|
Directory passed to |
|
|
|
|
Languages and Backend#
Key |
Description |
|---|---|
|
Required ISO 639-1 codes. Empty placeholders remind operators to set values explicitly. |
|
|
Translation Semantics#
Key |
Description |
|---|---|
|
Dot or wildcard path describing strings to translate. The default is |
|
Destination columns for translated text and downstream merges. |
|
|
|
Attach FAITH outputs adjacent to translations when enabled. |
|
Chat reconstruction switches. |
|
Segmenter behavior. Values include |
|
Throughput and partitioning controls surfaced to Curator readers and clients. |
LLM Fields#
Used whenever backend=llm or FAITH needs an OpenAI-compatible judge.
Key |
Description |
|---|---|
|
Chat-completions compatible base URL. |
|
Model identifier. Required for |
|
Environment variable housing the API secret. The default is |
|
Inline secret. Discouraged for shared repositories. |
FAITH Evaluation#
Key |
Description |
|---|---|
|
Turns FAITH scoring on. The starter YAML sets this to |
|
Minimum acceptable |
|
Optional scorer-only model. Defaults to |
|
Drop failing rows when |
|
Optional scorer-side concurrency limit. |
|
Optional OpenAI-compatible generation settings for the scorer. |
Backend-Specific Blocks#
Block |
When needed |
|---|---|
|
HTTP microservice URL, batching, timeouts. |
|
Project metadata and API version. Version |
|
Region plus concurrency limits. |
Overrides#
OmegaConf dotlists merge last:
uv run nemotron steps run translate/nemo_curator -c default \
backend=nmt \
nmt.server_url=http://localhost:5000 \
faith_eval.enabled=false \
input_path=/data/chat.jsonl \
output_dir=/data/out \
source_language=en \
target_language=hi