FAITH Evaluation Inside Translation#
This page explains how optional FAITH scoring behaves when faith_eval.enabled is true inside nemotron steps run translate/nemo_curator.
FAITH stands for the five quality dimensions the judge scores against each translated segment: Fluency, Accuracy, Idiomaticity, Terminology, and Handling of Format.
FAITH runs in the same TranslationStage invocation as translation. There is no separate CLI only for FAITH scoring.
What FAITH Adds#
FAITH scores translation quality segment-by-segment using a large language model (LLM) judge configured alongside your translation backend:
faith_eval.thresholddefines the minimum acceptable average score. The starter default is2.5, which you should tune per model. See the next section for what the scale means.FAITH scoring follows the translated segment pairs produced by Curator’s translation stage for long inputs.
faith_eval.filter_enableddrops failing rows whentrue, which lets you keep high-confidence shards only.
Score Scale#
The FAITH judge scores each of the five dimensions on a one-to-five scale, where one is poor and five is excellent.
The full per-dimension rubric lives in the upstream NeMo Curator prompt at nemo_curator/stages/text/experimental/translation/prompts/faith_eval.yaml.
The Fluency band quoted below is representative; the other four dimensions use the same one-to-five shape with dimension-specific wording.
1. **Fluency (1-5)**: Does the translation read naturally in the target language, free from grammar or syntax errors?
- 1: Very poor fluency, difficult to understand.
- 2: Somewhat fluent but with major grammatical issues.
- 3: Generally fluent with a few errors.
- 4: Mostly fluent but may have minor grammatical issues.
- 5: Perfect grammar, native-like fluency.
The judge also emits two sentinel values that sit outside the one-to-five scale, quoted here from the same prompt:
In case there is no translation provided, give -1 to all the categories!
If case of non-applicable score, make the score=0
A score of 0 means the dimension does not apply to the row, for example Terminology on a translation that contains no specialized terms.
A score of -1 means the judge received no translation to evaluate.
The faith_avg column is the mean of the dimensions that scored above zero.
Dimensions marked 0 for “not applicable” are excluded from the average, so a translation with no specialized terminology can still earn a perfect faith_avg of 5.0.
If every dimension is 0 or -1, faith_avg is 0.0.
Filtering keeps a row when faith_avg >= faith_eval.threshold, with parse failures preserved so reviewers can audit them.
The starter default 2.5 sits between band two, “major grammatical issues,” and band three, “generally fluent with a few errors.”
Treat 2.5 as a noisy-data floor rather than a quality bar.
Raise the threshold when you want a tighter quality gate, for example to 3.5 when you are building a high-confidence parallel corpus.
Why an LLM Client Is Always Required#
Whenever faith_eval.enabled is true, the stage instantiates an OpenAI-compatible client using server.url, server.api_key or server.api_key_env, and faith_eval.model_name. If you omit faith_eval.model_name, the stage falls back to server.model.
Even if backend is nmt, google, or aws, FAITH still issues LLM calls. Plan keys and quotas accordingly.
Merge Semantics#
merge_scores: true is the default. It attaches FAITH outputs alongside translated columns so reviewers can audit scores without losing structured chat payloads.