Translation#

Learn how to take an existing benchmark.parquet from generation, translate it to a target locale, score quality with backtranslation metrics, and export another benchmark.parquet.

The field names, defaults, and validation rules are listed in Translation Configuration Reference. Artifact paths are summarized in Output Files.

What You Configure#

Control	What you set
`dataset_path`	Absolute or workspace path to the source `benchmark.parquet` you want translated.
`output_dir` / `expt_name`	Where caches and the translated `benchmark.parquet` are written.
`source_language` / `target_language`	BCP-47 style tags, for example `en-US` and `hi-IN`.
`translation_model_config`	Curator experimental translation block: `backend_type`, `params` (model, provider, credentials, `inference_parameters`), plus optional `stage` and `segment_stage` maps.
`backtranslation_quality_metrics`	List of `{type, threshold}` entries. Each `type` must be `sacrebleu`, `chrf`, or `ter`; each `threshold` must be nonnegative. Keep at least one entry so scoring runs; outputs land in `quality_metrics.parquet` with per-metric scores and `is_quality_metric_passed`.
`remove_low_quality`	When `true` (the default if you omit the key), rows that fail the aggregate quality gate are dropped before the final export. When `false`, every row is kept so you can inspect scores first.

Do not set translation_model_config.stage.enable_faith_eval to true. Translation relies on backtranslation metrics instead of FAITH.

Running the Translate Stage#

Pass stage=translate unless your YAML sets a top-level stage key. The CLI requires an explicit stage when that key is absent.

uv run nemotron steps run byob/mcq -c translate stage=translate

Tune Quality Gates#

The backtranslation_quality_metrics field is the only place you define automatic pass or fail rules for backtranslation checks. Add or remove list entries to change which scores are computed, and adjust threshold values when you want stricter or looser gates.

After a run, open quality_metrics.parquet under output_dir/expt_name/stage_cache/ to read per-metric score columns and is_quality_metric_passed before you change YAML again.

Final Filtering Control#

remove_low_quality decides whether failing rows disappear from the exported benchmark.parquet.

remove_low_quality: true   # omit to get the same default

remove_low_quality: false  # keep failing rows; filter manually using Parquet columns

Reference Layout#

The sample src/nemotron/steps/byob/mcq/config/translate.yaml file shows a complete translation_model_config with backend_type: llm, NVIDIA provider parameters, and stage / segment_stage tuning. Copy that structure, then swap model IDs, concurrency, and language tags for your workload.

The YAML below mirrors the sample configu, including remove_low_quality: false so rows that fail the aggregate quality gate remain in benchmark.parquet and you can inspect stage_cache/quality_metrics.parquet while you tune thresholds.

When you omit remove_low_quality or set it to true, failing rows are dropped before export.

expt_name: byob_mcq_translation
dataset_path: /path/to/benchmark.parquet
output_dir: /path/to/outputs
source_language: en-US
target_language: hi-IN

translation_model_config:
  backend_type: llm
  params:
    alias: gpt-oss-120b
    model: openai/gpt-oss-120b
    provider: nvidia
    api_key_env: NGC_API_KEY
    inference_parameters:
      max_tokens: 16000
      max_parallel_requests: 8
      temperature: 0.0
      top_p: 0.95
  stage:
    segmentation_mode: coarse
    min_segment_chars: 0
    output_mode: both
  segment_stage:
    health_check: true
    max_concurrent_requests: 8

backtranslation_quality_metrics:
  - type: sacrebleu
    threshold: 25
  - type: chrf
    threshold: 50
  - type: ter
    threshold: 50

remove_low_quality: false

Directory Structure#

The translation stage writes intermediate Parquet files to <output_dir>/<expt_name>/<stage_cache> as translated_questions.parquet, backtranslated_questions.parquet, and quality_metrics.parquet, followed by benchmark_raw.parquet and the renamed benchmark.parquet in the experiment root. Use the intermediate files to debug language mix-ups, threshold misses, or model refusals before you change configuration again.