Bring Your Own Benchmark (BYOB)#

Create custom evaluation benchmarks in ~12 lines of Python using decorators, built-in scorers, and one-command containerization.

New to BYOB? See the quickstart below to create your first benchmark.

Prerequisites#

Python 3.10+
NeMo Evaluator installed (pip install nemo-evaluator)
An OpenAI-compatible model endpoint

Quickstart#

Step 1 – Write your benchmark#

Create a file called my_benchmark.py:

from nemo_evaluator.contrib.byob import benchmark, scorer, ScorerInput

@benchmark(
    name="my-qa",
    dataset="data.jsonl",
    prompt="Q: {question}\nA:",
    target_field="answer",
)
@scorer
def check(sample: ScorerInput) -> dict:
    return {"correct": sample.target.lower() in sample.response.lower()}

Step 2 – Compile#

nemo-evaluator-byob my_benchmark.py

Step 3 – Run#

nemo-evaluator run_eval \
  --eval_type byob_my_qa.my-qa \
  --model_url http://localhost:8000 \
  --model_id my-model \
  --model_type chat \
  --output_dir ./results \
  --api_key_name API_KEY

Tip

Use nemo-evaluator-byob my_benchmark.py --dry-run to validate your benchmark without installing it.

Reference Documentation#

Benchmark Decorator

Define benchmarks with the @benchmark decorator.

Benchmark Decorator

Scorers

Built-in scorers and custom scoring functions.

Scorers

LLM-as-Judge

Judge-based evaluation with LLM models.

LLM-as-Judge

Datasets

Dataset formats, HuggingFace URIs, and field mapping.

Datasets

CLI Reference

Compile, validate, list, and containerize benchmarks.

CLI Reference

Containerization

Package benchmarks as Docker images.