Benchmark Decorator#
The @benchmark and @scorer decorators are the user-facing API for defining BYOB benchmarks. Stack @benchmark (outer) on top of @scorer (inner) to register a scoring function with its dataset, prompt, and configuration.
from nemo_evaluator.contrib.byob import benchmark, scorer, ScorerInput
@benchmark(name="my-qa", dataset="data.jsonl", prompt="Q: {question}\nA:")
@scorer
def check(sample: ScorerInput) -> dict:
return {"correct": sample.target.lower() in sample.response.lower()}
Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
Human-readable benchmark name |
|
|
required |
Path to JSONL file or |
|
|
required |
Format string with |
|
|
|
Dataset field containing ground truth |
|
|
|
|
|
|
|
Pip deps (list or path to requirements.txt) |
|
|
|
Maps source columns to prompt field names |
|
|
|
Framework-specific params (judge config, etc.) |
|
|
|
JSONL field with pre-generated responses (eval-only mode) |
|
|
|
System prompt string or path to template file |
Name Normalization#
The name parameter is normalized to create a valid Python identifier used in the compiled package and eval_type string. The rules are:
Lowercase the name
Replace non-alphanumeric characters with underscores
Collapse consecutive underscores
Strip leading and trailing underscores
Truncate to 50 characters
For example, "My QA Benchmark!" becomes "my_qa_benchmark".
Warning
If a name normalizes to an empty string (for example, "!!!") the decorator raises a ValueError. Use a name containing at least one alphanumeric character.
Prompt Templates#
BYOB supports three ways to define prompts.
Inline format strings#
Pass a Python format string directly. Placeholders use {field} syntax and are filled from dataset columns (after any field_mapping is applied).
@benchmark(
name="trivia",
dataset="trivia.jsonl",
prompt="Q: {question}\nA:",
)
File-based templates#
Paths ending in .txt, .md, .jinja, or .jinja2 are read from disk. Relative paths resolve from the benchmark file’s directory.
@benchmark(
name="trivia",
dataset="trivia.jsonl",
prompt="prompts/trivia.txt",
)
Jinja2 templates#
Jinja2 rendering is activated when:
The prompt contains
{%(block tags) or{#(comments)The file has a
.jinjaor.jinja2extension
Variable-only Jinja2 templates ({{ var }} without block tags) require a .jinja or .jinja2 file extension to be detected.
Decorator Stacking#
@benchmark must be the outer (top) decorator and @scorer must be the inner (bottom) decorator. The @scorer decorator validates the function signature and marks it as a scorer. The @benchmark decorator then wraps the marked function and registers it in the benchmark registry.
@benchmark(...) # outer: registers the benchmark
@scorer # inner: validates and marks the function
def my_scorer(sample: ScorerInput) -> dict:
...
The @scorer decorator accepts functions with one or two parameters:
1 parameter (preferred):
def scorer(sample: ScorerInput)2 parameters:
def scorer(sample, config)
Functions with 0 or 3+ parameters are rejected with a TypeError.
Eval-Only Mode#
Set response_field to skip model inference and read responses directly from the dataset. This is useful for evaluating pre-generated outputs or comparing models offline.
@benchmark(
name="eval-only",
dataset="responses.jsonl",
prompt="Q: {question}\nA:",
response_field="model_output",
)
@scorer
def check(sample: ScorerInput) -> dict:
return {"correct": sample.target.lower() in sample.response.lower()}
The dataset must include the field specified by response_field:
{"question": "Is the sky blue?", "answer": "yes", "model_output": "Yes, the sky is blue."}
System Prompts#
Use system_prompt to prepend a system message to model calls. The value can be an inline string or a path to a template file (same resolution rules as prompt).
@benchmark(
name="code-review",
dataset="code.jsonl",
prompt="{code_snippet}",
system_prompt="You are an expert code reviewer. Be concise.",
)
@scorer
def review(sample: ScorerInput) -> dict:
return {"has_feedback": len(sample.response.strip()) > 0}
Note
System prompts support Jinja2 templates with the same detection rules as user prompts.
See Also#
Bring Your Own Benchmark (BYOB) – BYOB overview and quickstart
Scorers – Built-in scorers and custom scoring functions
Datasets – Dataset formats, HuggingFace URIs, and field mapping