Scorers#

Scorers evaluate model responses against ground truth. BYOB provides built-in scorers for common patterns and supports custom scorer functions.

ScorerInput#

Every scorer receives a single ScorerInput dataclass importable from nemo_evaluator.contrib.byob:

@dataclass
class ScorerInput:
    response: str              # Model output
    target: Any                # Ground truth from dataset
    metadata: dict             # Full dataset row as a dict
    model_call_fn: Optional[Callable] = None
    config: Dict[str, Any] = field(default_factory=dict)
    conversation: Optional[List[dict]] = None
    turn_index: Optional[int] = None

Field	Description
`response`	The model output text for the current sample.
`target`	The ground-truth value read from the field specified by `target_field` in `@benchmark`.
`metadata`	The entire dataset row as a dictionary, useful for accessing additional fields beyond the target.
`model_call_fn`	Reserved for multi-turn evaluation (not yet implemented).
`config`	Extra configuration passed through `extra=` in `@benchmark` (e.g. judge settings).
`conversation`	Reserved for multi-turn benchmarks (not yet implemented).
`turn_index`	Reserved for multi-turn benchmarks (not yet implemented).

The @scorer Decorator#

The @scorer decorator marks a function as a BYOB scorer. It validates the function signature at decoration time and sets an internal _is_scorer flag used by the framework.

A scorer must return a dict with string keys and bool or float values. These key-value pairs become the reported metrics.

from nemo_evaluator.contrib.byob import scorer, ScorerInput

@scorer
def my_scorer(sample: ScorerInput) -> dict:
    return {"correct": sample.response.strip().lower() == str(sample.target).strip().lower()}

Built-in Scorers#

Import built-in scorers from nemo_evaluator.contrib.byob.scorers:

Scorer	Returns	Description
`exact_match`	`{"correct": bool}`	Case-insensitive, whitespace-stripped equality
`contains`	`{"correct": bool}`	Case-insensitive substring match
`f1_token`	`{"f1": float, "precision": float, "recall": float}`	Token-level F1 using Counter intersection
`regex_match`	`{"correct": bool}`	Regex pattern match (target is the pattern)
`bleu`	`{"bleu_1": float, "bleu_2": float, "bleu_3": float, "bleu_4": float}`	Sentence-level BLEU-1 through BLEU-4 with add-1 smoothing
`rouge`	`{"rouge_1": float, "rouge_2": float, "rouge_l": float}`	ROUGE-1, ROUGE-2, ROUGE-L F1 scores
`retrieval_metrics`	`{"precision_at_k": float, "recall_at_k": float, "mrr": float, "ndcg": float}`	Retrieval quality metrics

Usage example#

from nemo_evaluator.contrib.byob import benchmark, scorer, ScorerInput
from nemo_evaluator.contrib.byob.scorers import exact_match

@benchmark(name="my-qa", dataset="data.jsonl", prompt="Q: {question}\nA:", target_field="answer")
@scorer
def check(sample: ScorerInput) -> dict:
    return exact_match(sample)

Note

retrieval_metrics expects two lists in sample.metadata:

retrieved – ordered list of retrieved item identifiers.
relevant – list of relevant (ground-truth) item identifiers.
k (optional) – cut-off depth; defaults to len(retrieved).

Make sure your JSONL dataset includes these fields for every row.

Writing Custom Scorers#

A custom scorer is any function decorated with @scorer that accepts a ScorerInput and returns a dict:

@scorer
def my_scorer(sample: ScorerInput) -> dict:
    # Access model response
    response = sample.response.strip()
    # Access ground truth
    target = str(sample.target).strip()
    # Access additional dataset fields
    category = sample.metadata.get("category", "unknown")

    is_correct = response.lower() == target.lower()
    return {
        "correct": is_correct,
        f"correct_{category}": is_correct,
    }

Tip

Return per-category breakdowns by including dynamic keys in the return dict. The aggregation layer will automatically compute means for every unique key across all samples, giving you fine-grained metric slices at no extra cost.

Combining built-in and custom logic#

You can call built-in scorers inside a custom scorer and merge the results:

from nemo_evaluator.contrib.byob.scorers import f1_token, exact_match

@scorer
def combined(sample: ScorerInput) -> dict:
    em = exact_match(sample)
    f1 = f1_token(sample)
    return {**em, **f1}