nemo_curator.stages.text.experimental.translation.evaluation.faith

View as Markdown

FAITH-based translation quality scoring and optional filtering.

Module Contents

Classes

NameDescription
FaithEvalFilterLLM-based translation quality scorer using the FAITH metric.
FaithThresholdFilterStageFilter document rows using precomputed FAITH scores.

Functions

NameDescription
_find_json_object_endReturn the balanced object end index starting at start, or -1.
_find_json_object_startReturn the first { outside a JSON string, or -1.
_to_mutable_dataframeReturn a DataFrame safe to mutate in-place for stage-local work.
_update_json_string_stateReturn updated JSON string state and whether ch was consumed by it.

Data

FAITH_KEYS

_SCORE_COLUMNS

API

class nemo_curator.stages.text.experimental.translation.evaluation.faith.FaithEvalFilter(
name: str = 'FaithEvalFilter',
source_lang: str,
target_lang: str,
model_name: str,
client: nemo_curator.models.client.llm_client.AsyncLLMClient | None = None,
source_text_field: str = 'text',
translated_text_field: str = 'translated_text',
threshold: float = 2.5,
filter_enabled: bool = True,
generation_config: nemo_curator.models.client.llm_client.GenerationConfig | None = None,
max_concurrent_requests: int = 64
)
Dataclass

Bases: ProcessingStage[DocumentBatch, DocumentBatch]

LLM-based translation quality scorer using the FAITH metric.

For each row in the incoming DocumentBatch, this stage:

  1. Formats a FAITH evaluation prompt with source and translated text.
  2. Calls the LLM via AsyncLLMClient to obtain a JSON score response.
  3. Parses the response for 5 FAITH dimension scores.
  4. Computes faith_avg (mean of the 5 scores).
  5. Optionally drops rows where faith_avg < threshold (when filter_enabled=True).

Parameters

client : AsyncLLMClient | None Async LLM client for scoring. Must not be None. model_name : str LLM model identifier to use for scoring. source_lang : str ISO 639-1 code of the source language (e.g. "en"). target_lang : str ISO 639-1 code of the target language (e.g. "zh"). source_text_field : str Column name containing the original source text. translated_text_field : str Column name containing the translated text. threshold : float Minimum faith_avg score to keep a row. Rows below this are dropped (only when filter_enabled=True). filter_enabled : bool When True (default), rows with faith_avg < threshold are dropped. When False, all rows are kept with their scores attached, enabling downstream score analysis before committing to a threshold. generation_config : GenerationConfig | None LLM generation parameters. Defaults to temperature=0.0, max_tokens=256.

_initialized
bool = field(init=False, repr=False, default=False)
_system_prompt
str = field(init=False, repr=False, default='')
_user_template
str = field(init=False, repr=False, default='')
client
AsyncLLMClient | None = None
filter_enabled
bool = True
generation_config
GenerationConfig | None = None
max_concurrent_requests
int = 64
model_name
str
name
str = 'FaithEvalFilter'
source_lang
str
source_text_field
str = 'text'
target_lang
str
threshold
float = 2.5
translated_text_field
str = 'translated_text'
nemo_curator.stages.text.experimental.translation.evaluation.faith.FaithEvalFilter.__post_init__() -> None
nemo_curator.stages.text.experimental.translation.evaluation.faith.FaithEvalFilter._attach_score_columns(
df: pandas.DataFrame,
all_scores: list[dict],
parse_failed_flags: list[bool]
) -> None

Write parsed FAITH scores back onto the DataFrame.

nemo_curator.stages.text.experimental.translation.evaluation.faith.FaithEvalFilter._build_messages(
source_text: str,
translated_text: str
) -> list[dict]

Build the chat messages for a single FAITH evaluation request.

nemo_curator.stages.text.experimental.translation.evaluation.faith.FaithEvalFilter._compute_faith_avg(
scores: dict
) -> float
staticmethod

Compute faith_avg as the mean of non-zero per-dimension scores.

Follows the “zero means not applicable” convention: dimensions scored as 0.0 are excluded from the average. If every dimension is zero, returns 0.0.

Parameters

scores : dict Dict keyed by :data:FAITH_KEYS (missing keys treated as 0).

nemo_curator.stages.text.experimental.translation.evaluation.faith.FaithEvalFilter._extract_json_object(
text: str
) -> str | None
staticmethod

Find and return the first balanced {...} JSON object in text.

Walks the string counting {/} pairs, respecting string literals so that braces inside quoted strings do not affect the balance and do not anchor the scan. For example, in 'message: "{pre}" scores: {"Fluency": 4}' the first { lives inside a string literal and must be ignored; the real object starts at the second {.

Supports nested objects (e.g. {"scores": {"Fluency": 4, ...}}).

Returns: str | None

Substring from the first real { to its matching }

nemo_curator.stages.text.experimental.translation.evaluation.faith.FaithEvalFilter._extract_scores_from_json(
text: str
) -> tuple[dict, bool]
classmethod

Extract FAITH scores from an LLM JSON response.

Finds the first balanced {...} block in text (with support for nested objects), parses it as JSON, and normalises the keys to the five FAITH dimensions. Missing keys default to 0.0.

A score of 0.0 follows the “zero means not applicable” convention (see :meth:_average_scores).

Returns: dict

Tuple of (scores, parse_failed) where scores is a dict

nemo_curator.stages.text.experimental.translation.evaluation.faith.FaithEvalFilter._filter_rows(
df: pandas.DataFrame
) -> pandas.DataFrame

Apply threshold filtering while preserving parse-failed rows.

nemo_curator.stages.text.experimental.translation.evaluation.faith.FaithEvalFilter._log_batch_scores(
df: pandas.DataFrame
) -> None

Log aggregate FAITH scores and parse-failure counts.

nemo_curator.stages.text.experimental.translation.evaluation.faith.FaithEvalFilter._score_all(
df: pandas.DataFrame
) -> list[str]

Score all rows using the async LLM client.

Handles event-loop edge cases (e.g. being called from within an existing async context such as a Ray async actor).

nemo_curator.stages.text.experimental.translation.evaluation.faith.FaithEvalFilter._score_all_async(
df: pandas.DataFrame
) -> list[str]
async

Issue concurrent LLM requests for every row.

Uses return_exceptions=True so that individual scoring failures do not abort the entire batch. Failed rows receive an empty string response, and the error is logged.

nemo_curator.stages.text.experimental.translation.evaluation.faith.FaithEvalFilter._score_batch(
df: pandas.DataFrame
) -> tuple[list[dict], list[bool]]

Run FAITH scoring for each row in the batch.

nemo_curator.stages.text.experimental.translation.evaluation.faith.FaithEvalFilter.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.experimental.translation.evaluation.faith.FaithEvalFilter.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.experimental.translation.evaluation.faith.FaithEvalFilter.process(
batch: nemo_curator.tasks.DocumentBatch
) -> nemo_curator.tasks.DocumentBatch

Score each translation row and filter rows below threshold.

nemo_curator.stages.text.experimental.translation.evaluation.faith.FaithEvalFilter.setup(
worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None

Initialize the LLM client and load prompt templates.

Prompt YAML loading and default generation config are deferred here (instead of __post_init__) for Ray compatibility: __post_init__ runs on the driver, while setup() runs on the worker.

class nemo_curator.stages.text.experimental.translation.evaluation.faith.FaithThresholdFilterStage(
name: str = 'FaithThresholdFilterStage',
threshold: float = 2.5
)
Dataclass

Bases: ProcessingStage[DocumentBatch, DocumentBatch]

Filter document rows using precomputed FAITH scores.

name
str = 'FaithThresholdFilterStage'
threshold
float = 2.5
nemo_curator.stages.text.experimental.translation.evaluation.faith.FaithThresholdFilterStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.experimental.translation.evaluation.faith.FaithThresholdFilterStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.experimental.translation.evaluation.faith.FaithThresholdFilterStage.process(
batch: nemo_curator.tasks.DocumentBatch
) -> nemo_curator.tasks.DocumentBatch

Drop rows below the FAITH threshold while preserving parse failures.

nemo_curator.stages.text.experimental.translation.evaluation.faith._find_json_object_end(
text: str,
start: int
) -> int

Return the balanced object end index starting at start, or -1.

nemo_curator.stages.text.experimental.translation.evaluation.faith._find_json_object_start(
text: str
) -> int

Return the first { outside a JSON string, or -1.

nemo_curator.stages.text.experimental.translation.evaluation.faith._to_mutable_dataframe(
batch: nemo_curator.tasks.DocumentBatch
) -> pandas.DataFrame

Return a DataFrame safe to mutate in-place for stage-local work.

nemo_curator.stages.text.experimental.translation.evaluation.faith._update_json_string_state(
ch: str,
in_string: bool,
escape: bool
) -> tuple[bool, bool, bool]

Return updated JSON string state and whether ch was consumed by it.

nemo_curator.stages.text.experimental.translation.evaluation.faith.FAITH_KEYS = ['Fluency', 'Accuracy', 'Idiomaticity', 'Terminology', 'Handling_of_Format']
nemo_curator.stages.text.experimental.translation.evaluation.faith._SCORE_COLUMNS = ['faith_fluency', 'faith_accuracy', 'faith_idiomaticity', 'faith_terminology', '...