Bring Your Own Benchmark (BYOB)#
Create custom evaluation benchmarks in ~12 lines of Python using decorators, built-in scorers, and one-command containerization.
New to BYOB? See the quickstart below to create your first benchmark.
Prerequisites#
Python 3.10+
NeMo Evaluator installed (
pip install nemo-evaluator)An OpenAI-compatible model endpoint
Quickstart#
Step 1 – Write your benchmark#
Create a file called my_benchmark.py:
from nemo_evaluator.contrib.byob import benchmark, scorer, ScorerInput
@benchmark(
name="my-qa",
dataset="data.jsonl",
prompt="Q: {question}\nA:",
target_field="answer",
)
@scorer
def check(sample: ScorerInput) -> dict:
return {"correct": sample.target.lower() in sample.response.lower()}
Step 2 – Compile#
nemo-evaluator-byob my_benchmark.py
Step 3 – Run#
nemo-evaluator run_eval \
--eval_type byob_my_qa.my-qa \
--model_url http://localhost:8000 \
--model_id my-model \
--model_type chat \
--output_dir ./results \
--api_key_name API_KEY
Tip
Use nemo-evaluator-byob my_benchmark.py --dry-run to validate your benchmark without installing it.
Reference Documentation#
Define benchmarks with the @benchmark decorator.
Built-in scorers and custom scoring functions.
Judge-based evaluation with LLM models.
Dataset formats, HuggingFace URIs, and field mapping.
Compile, validate, list, and containerize benchmarks.
Package benchmarks as Docker images.
Examples#
Complete annotated examples are available in the source repository under packages/nemo-evaluator/examples/byob/:
MedMCQA – HuggingFace dataset with field mapping and custom letter-extraction scorer
Global MMLU Lite – Multilingual MMLU with per-category scoring breakdowns
TruthfulQA – LLM-as-Judge with custom template and
**template_kwargsMath Reasoning – Numeric extraction with tolerance comparison