Scorers#
Scorers evaluate model responses against ground truth. BYOB provides built-in scorers for common patterns and supports custom scorer functions.
ScorerInput#
Every scorer receives a single ScorerInput dataclass importable from nemo_evaluator.contrib.byob:
@dataclass
class ScorerInput:
response: str # Model output
target: Any # Ground truth from dataset
metadata: dict # Full dataset row as a dict
model_call_fn: Optional[Callable] = None
config: Dict[str, Any] = field(default_factory=dict)
conversation: Optional[List[dict]] = None
turn_index: Optional[int] = None
Field |
Description |
|---|---|
|
The model output text for the current sample. |
|
The ground-truth value read from the field specified by |
|
The entire dataset row as a dictionary, useful for accessing additional fields beyond the target. |
|
Reserved for multi-turn evaluation (not yet implemented). |
|
Extra configuration passed through |
|
Reserved for multi-turn benchmarks (not yet implemented). |
|
Reserved for multi-turn benchmarks (not yet implemented). |
The @scorer Decorator#
The @scorer decorator marks a function as a BYOB scorer. It validates the function signature at decoration time and sets an internal _is_scorer flag used by the framework.
A scorer must return a dict with string keys and bool or float values. These key-value pairs become the reported metrics.
from nemo_evaluator.contrib.byob import scorer, ScorerInput
@scorer
def my_scorer(sample: ScorerInput) -> dict:
return {"correct": sample.response.strip().lower() == str(sample.target).strip().lower()}
Built-in Scorers#
Import built-in scorers from nemo_evaluator.contrib.byob.scorers:
Scorer |
Returns |
Description |
|---|---|---|
|
|
Case-insensitive, whitespace-stripped equality |
|
|
Case-insensitive substring match |
|
|
Token-level F1 using Counter intersection |
|
|
Regex pattern match (target is the pattern) |
|
|
Sentence-level BLEU-1 through BLEU-4 with add-1 smoothing |
|
|
ROUGE-1, ROUGE-2, ROUGE-L F1 scores |
|
|
Retrieval quality metrics |
Usage example#
from nemo_evaluator.contrib.byob import benchmark, scorer, ScorerInput
from nemo_evaluator.contrib.byob.scorers import exact_match
@benchmark(name="my-qa", dataset="data.jsonl", prompt="Q: {question}\nA:", target_field="answer")
@scorer
def check(sample: ScorerInput) -> dict:
return exact_match(sample)
Note
retrieval_metrics expects two lists in sample.metadata:
retrieved– ordered list of retrieved item identifiers.relevant– list of relevant (ground-truth) item identifiers.k(optional) – cut-off depth; defaults tolen(retrieved).
Make sure your JSONL dataset includes these fields for every row.
Writing Custom Scorers#
A custom scorer is any function decorated with @scorer that accepts a ScorerInput and returns a dict:
@scorer
def my_scorer(sample: ScorerInput) -> dict:
# Access model response
response = sample.response.strip()
# Access ground truth
target = str(sample.target).strip()
# Access additional dataset fields
category = sample.metadata.get("category", "unknown")
is_correct = response.lower() == target.lower()
return {
"correct": is_correct,
f"correct_{category}": is_correct,
}
Tip
Return per-category breakdowns by including dynamic keys in the return dict. The aggregation layer will automatically compute means for every unique key across all samples, giving you fine-grained metric slices at no extra cost.
Combining built-in and custom logic#
You can call built-in scorers inside a custom scorer and merge the results:
from nemo_evaluator.contrib.byob.scorers import f1_token, exact_match
@scorer
def combined(sample: ScorerInput) -> dict:
em = exact_match(sample)
f1 = f1_token(sample)
return {**em, **f1}
See Also#
Bring Your Own Benchmark (BYOB) – BYOB overview and quickstart
LLM-as-Judge – LLM-as-Judge evaluation for subjective criteria