Scorers#

Scorers evaluate model responses against ground truth. BYOB provides built-in scorers for common patterns and supports custom scorer functions.

ScorerInput#

Every scorer receives a single ScorerInput dataclass importable from nemo_evaluator.contrib.byob:

@dataclass
class ScorerInput:
    response: str              # Model output (or argmax choice in logprob mode)
    target: Any                # Ground truth from dataset
    metadata: dict             # Dataset row + per-call response metadata
    model_call_fn: Optional[Callable] = None
    config: Dict[str, Any] = field(default_factory=dict)
    conversation: Optional[List[dict]] = None
    turn_index: Optional[int] = None

Field	Description
`response`	The model output text for the current sample. In `completions_logprob` mode this is set to the choice with the highest sum-logprob (i.e. the argmax).
`target`	The ground-truth value read from the field specified by `target_field` in `@benchmark`.
`metadata`	Shared bag for dataset-row fields and per-call response metadata. Standard scorers use it to access any column on the row (e.g. `sample.metadata["passage"]`). Strategies that produce extra per-call data write namespaced keys (prefixed with `_`) into this dict before invoking the scorer.
`model_call_fn`	Reserved for multi-turn evaluation (not yet implemented).
`config`	Extra configuration passed through `extra=` in `@benchmark` (e.g. judge settings).
`conversation`	Reserved for multi-turn benchmarks (not yet implemented).
`turn_index`	Reserved for multi-turn benchmarks (not yet implemented).

Reserved metadata keys#

MultipleChoiceStrategy (selected by endpoint_type="completions_logprob") writes the following keys into ScorerInput.metadata before invoking the scorer:

Key	Type	Description
`_choices`	`list[str]`	Candidate continuations resolved from `choices=` or `choices_field=` on `@benchmark`.
`_choices_logprobs`	`list[float]`	Per-choice sum log-probabilities returned by the loglikelihood call. Same length as `_choices`.
`_choices_is_greedy`	`list[bool]`	Per-choice booleans: `True` when every continuation token equals the top-1 prediction (i.e. the choice would have been produced under greedy decoding). Same length as `_choices`.

response is also set to _choices[argmax(_choices_logprobs)] so legacy text-based scorers continue to work in logprob mode.

The @scorer Decorator#

The @scorer decorator marks a function as a BYOB scorer. It validates the function signature at decoration time and sets an internal _is_scorer flag used by the framework.

A scorer must return a dict with string keys and bool or float values. These key-value pairs become the reported metrics.

from nemo_evaluator.contrib.byob import scorer, ScorerInput

@scorer
def my_scorer(sample: ScorerInput) -> dict:
    return {"correct": sample.response.strip().lower() == str(sample.target).strip().lower()}

Built-in Scorers#

Import built-in scorers from nemo_evaluator.contrib.byob.scorers:

Scorer	Returns	Description
`exact_match`	`{"correct": bool}`	Case-insensitive, whitespace-stripped equality
`contains`	`{"correct": bool}`	Case-insensitive substring match
`f1_token`	`{"f1": float, "precision": float, "recall": float}`	Token-level F1 using Counter intersection
`regex_match`	`{"correct": bool}`	Regex pattern match (target is the pattern)
`bleu`	`{"bleu_1": float, "bleu_2": float, "bleu_3": float, "bleu_4": float}`	Sentence-level BLEU-1 through BLEU-4 with add-1 smoothing
`rouge`	`{"rouge_1": float, "rouge_2": float, "rouge_l": float}`	ROUGE-1, ROUGE-2, ROUGE-L F1 scores
`retrieval_metrics`	`{"precision_at_k": float, "recall_at_k": float, "mrr": float, "ndcg": float}`	Retrieval quality metrics
`multiple_choice_acc`	`{"acc": float, "acc_norm": float, "acc_greedy": float}`	Multiple-choice loglikelihood ranking. `acc` matches lm-evaluation-harness MMLU-style raw argmax; `acc_norm` is per-byte length-normalized argmax (ARC/BoolQ style); `acc_greedy` is the highest-loglikelihood greedy choice. Requires `endpoint_type="completions_logprob"` and either `choices=` or `choices_field=` on `@benchmark`.
`mcq_letter_extract`	`{"correct": bool, "parsed": bool}`	Extracts an A-J letter from free-form text (handles “A”, “A)”, “The answer is B”, “(C)”, “Option D”, and `\boxed{E}`). Targets may be letters, integer indices, or verbatim choice text from the metadata `a`/`b`/`c`/`d` keys. Empty or `None` responses are treated as unparsed rather than raising.
`gsm8k_answer`	`{"correct": bool, "parsed": bool}`	Canonical GSM8K numeric extractor. Tries the `#### <number>` marker first, then `\boxed{<number>}`, then falls back to the last number in the response. Strips commas and normalizes trailing zeros.
`boolean_yesno`	`{"correct": bool, "parsed": bool}`	Extracts English yes/no decisions from free-form text. Recognizes tokens such as yes/no/yep/nope/true/false.
`chrf`	`{"chrf": float, "chrf_pp": float}`	Sentence-level chrF and chrF++ in [0, 100]. Pure-Python sacrebleu-style formula (character 1- to 6-gram F2; chrF++ adds word 1- and 2-gram F2).

Usage example#

from nemo_evaluator.contrib.byob import benchmark, scorer, ScorerInput
from nemo_evaluator.contrib.byob.scorers import exact_match

@benchmark(name="my-qa", dataset="data.jsonl", prompt="Q: {question}\nA:", target_field="answer")
@scorer
def check(sample: ScorerInput) -> dict:
    return exact_match(sample)

Note

retrieval_metrics expects two lists in sample.metadata:

retrieved – ordered list of retrieved item identifiers.
relevant – list of relevant (ground-truth) item identifiers.
k (optional) – cut-off depth; defaults to len(retrieved).

Make sure your JSONL dataset includes these fields for every row.

Writing Custom Scorers#

A custom scorer is any function decorated with @scorer that accepts a ScorerInput and returns a dict:

@scorer
def my_scorer(sample: ScorerInput) -> dict:
    # Access model response
    response = sample.response.strip()
    # Access ground truth
    target = str(sample.target).strip()
    # Access additional dataset fields
    category = sample.metadata.get("category", "unknown")

    is_correct = response.lower() == target.lower()
    return {
        "correct": is_correct,
        f"correct_{category}": is_correct,
    }

Tip

Return per-category breakdowns by including dynamic keys in the return dict. The aggregation layer will automatically compute means for every unique key across all samples, giving you fine-grained metric slices at no extra cost.

Combining built-in and custom logic#

You can call built-in scorers inside a custom scorer and merge the results:

from nemo_evaluator.contrib.byob.scorers import f1_token, exact_match

@scorer
def combined(sample: ScorerInput) -> dict:
    em = exact_match(sample)
    f1 = f1_token(sample)
    return {**em, **f1}

Multiple-choice loglikelihood ranking#

For MMLU-, ARC-, and BoolQ-style benchmarks, BYOB supports per-choice loglikelihood ranking with lm-evaluation-harness parity:

from nemo_evaluator.contrib.byob import benchmark, scorer, ScorerInput
from nemo_evaluator.contrib.byob.scorers import multiple_choice_acc

@benchmark(
    name="mmlu-mini",
    dataset="hf://my-org/my-mmlu?split=test",
    prompt="Question: {question}\nAnswer:",
    target_field="answer",                    # gold letter, e.g. "B"
    endpoint_type="completions_logprob",      # enables loglikelihood scoring
    choices=[" A", " B", " C", " D"],         # static candidates per row
    num_fewshot=5,                            # optional fewshot prefix
)
@scorer
def mmlu_score(sample: ScorerInput) -> dict:
    return multiple_choice_acc(sample)        # {acc, acc_norm, acc_greedy}

For datasets with per-row variable choices (e.g. ARC), set choices_field instead of choices:

@benchmark(
    ...,
    choices_field="choices_text",             # row[choices_text] is a list[str]
)

Nested/dotted fields are also supported for HuggingFace datasets that store choices under a struct-like column:

@benchmark(
    ...,
    choices_field="choices.text",             # row["choices"]["text"]
)

How it works#

MultipleChoiceStrategy (selected automatically when endpoint_type="completions_logprob") calls the OpenAI-compatible /v1/completions endpoint once per choice, exactly like lm-eval’s local-completions adapter:

POST /v1/completions
{
  "model": "...",
  "prompt": "<context><continuation>",
  "max_tokens": 0,
  "logprobs": 1,
  "echo": true,
  "temperature": 0
}

The runner inspects logprobs.text_offset to locate the continuation token span, sums token_logprobs over that span, and decides is_greedy by checking whether each continuation token matches the top-1 entry of top_logprobs. The resulting per-choice (sum_logprob, is_greedy) tuples are written into ScorerInput.metadata under the reserved keys _choices, _choices_logprobs, and _choices_is_greedy. multiple_choice_acc then computes:

acc – 1.0 iff argmax(metadata["_choices_logprobs"]) == gold_index (MMLU canonical).
acc_norm – 1.0 iff argmax(metadata["_choices_logprobs"][i] / max(len(metadata["_choices"][i].encode("utf-8")), 1)) == gold_index (ARC/BoolQ canonical, per-byte length normalization).
acc_greedy – 1.0 iff the highest-loglikelihood greedy choice matches gold (diagnostic).

The gold answer can be a letter ("A".."J"), an integer index, or the verbatim choice string – multiple_choice_acc handles all three.