Write Your Own Benchmark (BYOB)#
Define a complete benchmark with @benchmark + @scorer in under 10 lines.
Minimal Example#
from nemo_evaluator import benchmark, scorer, ScorerInput, exact_match
@benchmark(
name="my-bench",
dataset="hf://my-org/my-dataset?split=test",
prompt="Question: {question}\nAnswer:",
target_field="answer",
)
@scorer
def my_scorer(sample: ScorerInput) -> dict:
return exact_match(sample)
Run it:
nel eval run --bench my-bench
That is the entire workflow. The @benchmark decorator registers an environment, loads
the dataset, formats prompts, and wires scoring. No subclass boilerplate required.
How It Works#
flowchart LR
A["@benchmark"] --> B["ByobEnvironment"]
B --> C["seed(idx)"]
B --> D["verify(response, expected)"]
C --> E["Prompt + Expected Answer"]
D --> F["@scorer function"]
F --> G["Reward + Scoring Details"]
@benchmarkcreates aByobEnvironment(subclass ofEvalEnvironment) and registers it by name.On
seed(idx), the environment loads the dataset row, formats the prompt template, and returns aSeedResult.On
verify(response, expected), it calls your@scorerfunction with aScorerInputand converts the result to aVerifyResult.
Step-by-Step#
Step 1: Create the benchmark file#
Create src/nemo_evaluator/benchmarks/my_reasoning.py:
from nemo_evaluator import benchmark, scorer, ScorerInput, answer_line
@benchmark(
name="my_reasoning",
dataset="hf://your-org/reasoning-2026?split=test",
prompt=(
"Solve the following problem step by step.\n\n"
"{question}\n\n"
"Put your final answer after 'Answer:'."
),
target_field="answer",
)
@scorer
def my_reasoning_scorer(sample: ScorerInput) -> dict:
return answer_line(sample)
Step 2: Register the import#
Add the import to src/nemo_evaluator/benchmarks/__init__.py:
from nemo_evaluator.benchmarks.my_reasoning import my_reasoning_scorer
Step 3: Validate#
nel validate -b my_reasoning --samples 10
Expected output:
my_reasoning: 10 samples
7/10 correct
[PASS] p0: expected='42' got='42' (1230ms 156tok)
[PASS] p1: expected='7/3' got='7/3' (980ms 134tok)
[FAIL] p2: expected='256' got='512' (1100ms 201tok)
...
Step 4: Run full evaluation#
nel eval run --bench my_reasoning --repeats 4 --output-dir ./results/my_reasoning
Step 5: Serve for Gym training#
nel serve -b my_reasoning -p 9090
Gym training connects at http://hostname:9090.
Decorator Reference#
@benchmark parameters#
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
|
Yes |
Environment name used in |
|
|
Yes |
HuggingFace URI ( |
|
|
Yes |
Python format string using dataset field names |
|
|
No |
Dataset field containing the expected answer (default: |
|
|
No |
|
|
|
No |
System message prepended to the conversation |
|
|
No |
Rename dataset fields before prompt formatting |
|
|
No |
|
|
|
No |
|
|
|
No |
Arbitrary config passed to scorer via |
|
|
No |
Python packages required by this benchmark |
@scorer function signature#
@scorer
def my_scorer(sample: ScorerInput) -> dict:
...
The function receives a ScorerInput:
Field |
Type |
Description |
|---|---|---|
|
|
Model’s raw response text |
|
|
Expected answer from the dataset |
|
|
Row metadata (all non-target fields) |
|
|
The |
Return a dict. The key "correct" (or "reward") is converted to the numeric reward.
Optionally include "extracted" for the extracted answer string.
Dataset Specs#
Format |
Example |
|---|---|
HuggingFace |
|
Local JSONL |
|
Callable |
|
HuggingFace URIs support split and config query parameters.
Scoring Primitives#
Built-in functions you can call from your scorer:
Function |
Use case |
|---|---|
|
Normalized string equality |
|
Extract A-D/A-J from “Answer: X” pattern |
|
Extract answer after “Answer:” line |
|
Last number in response |
|
Substring containment with aliases |
|
Docker-sandboxed code execution |
|
Flag for LLM-as-judge post-processing |
All are importable from the top-level package:
from nemo_evaluator import exact_match, multichoice_regex, numeric_match
Extension Hooks#
prepare_row: Transform dataset rows#
Use prepare_row when the raw dataset needs restructuring before prompt formatting.
def shuffle_choices(row, idx, rng):
"""Shuffle GPQA answer choices and track the new correct index."""
choices = [row["Correct Answer"], row["Incorrect Answer 1"],
row["Incorrect Answer 2"], row["Incorrect Answer 3"]]
rng.shuffle(choices)
correct_idx = choices.index(row["Correct Answer"])
return {**row, "A": choices[0], "B": choices[1],
"C": choices[2], "D": choices[3],
"answer": "ABCD"[correct_idx]}
@benchmark(
name="gpqa",
dataset="hf://Idavidrein/gpqa?config=gpqa_diamond&split=train",
prompt=PROMPT_TEMPLATE,
target_field="answer",
prepare_row=shuffle_choices,
)
@scorer
def gpqa_scorer(sample: ScorerInput) -> dict:
return multichoice_regex(sample)
seed_fn: Fully custom seed#
When you need complete control over prompt construction:
def custom_seed(row, idx):
from nemo_evaluator import SeedResult
messages = [
{"role": "system", "content": "You are a math tutor."},
{"role": "user", "content": row["problem"]},
]
return SeedResult(
prompt=row["problem"],
expected_answer=row["answer"],
messages=messages,
system="You are a math tutor.",
)
@benchmark(name="custom", dataset="hf://...", prompt="", seed_fn=custom_seed)
@scorer
def custom_scorer(sample: ScorerInput) -> dict:
return numeric_match(sample)
Real-World Examples#
All 15 built-in benchmarks use @benchmark + @scorer. See src/nemo_evaluator/benchmarks/ for reference implementations:
Benchmark |
Scorer |
Key technique |
|---|---|---|
|
|
|
|
|
Docker-sandboxed test execution |
|
|
LLM-as-judge post-processing |
|
|
|
|
|
Math answer extraction |
|
|
Last-number extraction |
Advanced: Subclass EvalEnvironment Directly#
For benchmarks that cannot be expressed with decorators (e.g., multi-turn, stateful),
subclass EvalEnvironment:
from nemo_evaluator import EvalEnvironment, SeedResult, VerifyResult, register
@register("my_complex_bench")
class MyComplexBenchmark(EvalEnvironment):
def __init__(self):
super().__init__()
self._dataset = [...]
async def seed(self, idx: int) -> SeedResult:
row = self._dataset[idx]
return SeedResult(prompt=row["prompt"], expected_answer=row["answer"])
async def verify(self, response: str, expected: str, **meta) -> VerifyResult:
correct = response.strip() == expected.strip()
return VerifyResult(reward=1.0 if correct else 0.0)
The decorator path is preferred for single-turn benchmarks. Reserve subclassing for cases that genuinely need it.
Parametrizing a Built-in Benchmark from YAML#
Built-in environments registered with @register("<name>") can be tweaked
from a config without a new Python module. BenchmarkConfig accepts a
params: mapping whose entries are forwarded to the environment
constructor, filtered to the arguments it actually declares. Unknown keys
raise a clear TypeError at resolution time.
benchmarks:
- name: nmp_harbor
params:
task_names: ["workspace-basic-cli"] # run a single task
solver:
type: harbor
service: anthropic
agent: claude-code
sandbox:
type: docker
Scalar values are accepted for list parameters (task_names: workspace-basic-cli)
so the CLI override flag works too: -O benchmarks[0].params.task_names=workspace-basic-cli.
Wrapping a Built-in Benchmark with Extra Pre-Build Steps#
When a dataset needs a one-off base image produced from a Dockerfile in a
consumer repo, subclass the relevant environment under
src/nemo_evaluator/benchmarks/ and prepend your own ImageBuildRequest.
See src/nemo_evaluator/benchmarks/nmp_harbor.py for a working reference:
Subclasses
HarborEnvironmentand registers under@register("nmp_harbor").Reads
NMP_REPO(or thenmp_repoparam) and points the dataset at$NMP_REPO/tests/agentic-use.Overrides
image_build_requests()to prepend a singleImageBuildRequestthat buildsnmp-harbor:latestfrom$NMP_REPO/Dockerfile.harborbefore any per-task image build runs.
benchmarks:
- name: nmp_harbor # reads $NMP_REPO; no manual docker build
solver:
type: harbor
service: anthropic
agent: claude-code
sandbox:
type: docker
The sandbox manager treats the prepended request identically to any other
image build, so the same config works against Docker, ECR/ECS Fargate, etc.
See examples/configs/10_nmp_harbor.yaml.