Write Your Own Benchmark (BYOB)#

Define a complete benchmark with @benchmark + @scorer in under 10 lines.

Minimal Example#

from nemo_evaluator import benchmark, scorer, ScorerInput, exact_match

@benchmark(
    name="my-bench",
    dataset="hf://my-org/my-dataset?split=test",
    prompt="Question: {question}\nAnswer:",
    target_field="answer",
)
@scorer
def my_scorer(sample: ScorerInput) -> dict:
    return exact_match(sample)

Run it:

nel eval run --bench my-bench

That is the entire workflow. The @benchmark decorator registers an environment, loads the dataset, formats prompts, and wires scoring. No subclass boilerplate required.

How It Works#

        flowchart LR
    A["@benchmark"] --> B["ByobEnvironment"]
    B --> C["seed(idx)"]
    B --> D["verify(response, expected)"]
    C --> E["Prompt + Expected Answer"]
    D --> F["@scorer function"]
    F --> G["Reward + Scoring Details"]

@benchmark creates a ByobEnvironment (subclass of EvalEnvironment) and registers it by name.
On seed(idx), the environment loads the dataset row, formats the prompt template, and returns a SeedResult.
On verify(response, expected), it calls your @scorer function with a ScorerInput and converts the result to a VerifyResult.

Step-by-Step#

Step 1: Create the benchmark file#

Create src/nemo_evaluator/benchmarks/my_reasoning.py:

from nemo_evaluator import benchmark, scorer, ScorerInput, answer_line

@benchmark(
    name="my_reasoning",
    dataset="hf://your-org/reasoning-2026?split=test",
    prompt=(
        "Solve the following problem step by step.\n\n"
        "{question}\n\n"
        "Put your final answer after 'Answer:'."
    ),
    target_field="answer",
)
@scorer
def my_reasoning_scorer(sample: ScorerInput) -> dict:
    return answer_line(sample)

Step 2: Register the import#

Add the import to src/nemo_evaluator/benchmarks/__init__.py:

from nemo_evaluator.benchmarks.my_reasoning import my_reasoning_scorer

Step 3: Validate#

nel validate -b my_reasoning --samples 10

Expected output:

my_reasoning: 10 samples
  7/10 correct
  [PASS] p0: expected='42' got='42' (1230ms 156tok)
  [PASS] p1: expected='7/3' got='7/3' (980ms 134tok)
  [FAIL] p2: expected='256' got='512' (1100ms 201tok)
  ...

Step 4: Run full evaluation#

nel eval run --bench my_reasoning --repeats 4 --output-dir ./results/my_reasoning

Step 5: Serve for Gym training#

nel serve -b my_reasoning -p 9090

Gym training connects at http://hostname:9090.

Decorator Reference#

`@benchmark` parameters#

Parameter	Type	Required	Description
`name`	`str`	Yes	Environment name used in `nel eval run --bench <name>`
`dataset`	`str \| Callable`	Yes	HuggingFace URI (`hf://...`), local JSONL path, or callable returning `list[dict]`
`prompt`	`str`	Yes	Python format string using dataset field names
`target_field`	`str`	No	Dataset field containing the expected answer (default: `"target"`)
`endpoint_type`	`str`	No	`"chat"` or `"completion"` (default: `"chat"`). In YAML configs, protocol is set on the service instead.
`system_prompt`	`str`	No	System message prepended to the conversation
`field_mapping`	`dict`	No	Rename dataset fields before prompt formatting
`prepare_row`	`Callable`	No	`(row, idx, rng) -> row` – transform each dataset row after loading
`seed_fn`	`Callable`	No	`(row, idx) -> SeedResult` – fully custom seed (overrides prompt template)
`extra`	`dict`	No	Arbitrary config passed to scorer via `sample.config`
`requirements`	`list[str]`	No	Python packages required by this benchmark

`@scorer` function signature#

@scorer
def my_scorer(sample: ScorerInput) -> dict:
    ...

The function receives a ScorerInput:

Field	Type	Description
`sample.response`	`str`	Model’s raw response text
`sample.target`	`Any`	Expected answer from the dataset
`sample.metadata`	`dict`	Row metadata (all non-target fields)
`sample.config`	`dict`	The `extra` dict from `@benchmark`

Return a dict. The key "correct" (or "reward") is converted to the numeric reward. Optionally include "extracted" for the extracted answer string.

Dataset Specs#

Format	Example
HuggingFace	`"hf://cais/mmlu?config=all&split=test"`
Local JSONL	`"data/my_bench.jsonl"`
Callable	`lambda: [{"q": "2+2", "a": "4"}]`

HuggingFace URIs support split and config query parameters.

Scoring Primitives#

Built-in functions you can call from your scorer:

Function	Use case
`exact_match(sample)`	Normalized string equality
`multichoice_regex(sample)`	Extract A-D/A-J from “Answer: X” pattern
`answer_line(sample)`	Extract answer after “Answer:” line
`numeric_match(sample)`	Last number in response
`fuzzy_match(sample)`	Substring containment with aliases
`code_sandbox(sample)`	Docker-sandboxed code execution
`needs_judge(sample)`	Flag for LLM-as-judge post-processing

All are importable from the top-level package:

from nemo_evaluator import exact_match, multichoice_regex, numeric_match

Extension Hooks#

`prepare_row`: Transform dataset rows#

Use prepare_row when the raw dataset needs restructuring before prompt formatting.

def shuffle_choices(row, idx, rng):
    """Shuffle GPQA answer choices and track the new correct index."""
    choices = [row["Correct Answer"], row["Incorrect Answer 1"],
               row["Incorrect Answer 2"], row["Incorrect Answer 3"]]
    rng.shuffle(choices)
    correct_idx = choices.index(row["Correct Answer"])
    return {**row, "A": choices[0], "B": choices[1],
            "C": choices[2], "D": choices[3],
            "answer": "ABCD"[correct_idx]}

@benchmark(
    name="gpqa",
    dataset="hf://Idavidrein/gpqa?config=gpqa_diamond&split=train",
    prompt=PROMPT_TEMPLATE,
    target_field="answer",
    prepare_row=shuffle_choices,
)
@scorer
def gpqa_scorer(sample: ScorerInput) -> dict:
    return multichoice_regex(sample)

`seed_fn`: Fully custom seed#

When you need complete control over prompt construction:

def custom_seed(row, idx):
    from nemo_evaluator import SeedResult
    messages = [
        {"role": "system", "content": "You are a math tutor."},
        {"role": "user", "content": row["problem"]},
    ]
    return SeedResult(
        prompt=row["problem"],
        expected_answer=row["answer"],
        messages=messages,
        system="You are a math tutor.",
    )

@benchmark(name="custom", dataset="hf://...", prompt="", seed_fn=custom_seed)
@scorer
def custom_scorer(sample: ScorerInput) -> dict:
    return numeric_match(sample)

Real-World Examples#

All 15 built-in benchmarks use @benchmark + @scorer. See src/nemo_evaluator/benchmarks/ for reference implementations:

Benchmark	Scorer	Key technique
`mmlu.py`	`multichoice_regex`	`prepare_row` to unpack choices
`humaneval.py`	`code_sandbox`	Docker-sandboxed test execution
`simpleqa.py`	`needs_judge`	LLM-as-judge post-processing
`gpqa.py`	`multichoice_regex`	`prepare_row` to shuffle choices
`math500.py`	`answer_line`	Math answer extraction
`gsm8k.py`	`numeric_match`	Last-number extraction

Advanced: Subclass `EvalEnvironment` Directly#

For benchmarks that cannot be expressed with decorators (e.g., multi-turn, stateful), subclass EvalEnvironment:

from nemo_evaluator import EvalEnvironment, SeedResult, VerifyResult, register

@register("my_complex_bench")
class MyComplexBenchmark(EvalEnvironment):
    def __init__(self):
        super().__init__()
        self._dataset = [...]

    async def seed(self, idx: int) -> SeedResult:
        row = self._dataset[idx]
        return SeedResult(prompt=row["prompt"], expected_answer=row["answer"])

    async def verify(self, response: str, expected: str, **meta) -> VerifyResult:
        correct = response.strip() == expected.strip()
        return VerifyResult(reward=1.0 if correct else 0.0)

The decorator path is preferred for single-turn benchmarks. Reserve subclassing for cases that genuinely need it.

Parametrizing a Built-in Benchmark from YAML#

Built-in environments registered with @register("<name>") can be tweaked from a config without a new Python module. BenchmarkConfig accepts a params: mapping whose entries are forwarded to the environment constructor, filtered to the arguments it actually declares. Unknown keys raise a clear TypeError at resolution time.

benchmarks:
  - name: nmp_harbor
    params:
      task_names: ["workspace-basic-cli"]   # run a single task
    solver:
      type: harbor
      service: anthropic
      agent: claude-code
    sandbox:
      type: docker

Scalar values are accepted for list parameters (task_names: workspace-basic-cli) so the CLI override flag works too: -O benchmarks[0].params.task_names=workspace-basic-cli.

Wrapping a Built-in Benchmark with Extra Pre-Build Steps#

When a dataset needs a one-off base image produced from a Dockerfile in a consumer repo, subclass the relevant environment under src/nemo_evaluator/benchmarks/ and prepend your own ImageBuildRequest. See src/nemo_evaluator/benchmarks/nmp_harbor.py for a working reference:

Subclasses HarborEnvironment and registers under @register("nmp_harbor").
Reads NMP_REPO (or the nmp_repo param) and points the dataset at $NMP_REPO/tests/agentic-use.
Overrides image_build_requests() to prepend a single ImageBuildRequest that builds nmp-harbor:latest from $NMP_REPO/Dockerfile.harbor before any per-task image build runs.

benchmarks:
  - name: nmp_harbor             # reads $NMP_REPO; no manual docker build
    solver:
      type: harbor
      service: anthropic
      agent: claude-code
    sandbox:
      type: docker

The sandbox manager treats the prepended request identically to any other image build, so the same config works against Docker, ECR/ECS Fargate, etc. See examples/configs/10_nmp_harbor.yaml.