Bring Your Own Benchmark (BYOB)#
Create custom evaluation benchmarks for NeMo Evaluator with few lines of Python code.
New to BYOB? See the quickstart below to create your first benchmark.
Prerequisites#
Python 3.10+
Activated virtual environment with NeMo Evaluator installed:
pip install nemo-evaluatorAn OpenAI-compatible model endpoint
Quickstart (5 minutes)#
The following walkthrough uses the MedMCQA example – a medical multiple-choice QA benchmark sourced from HuggingFace.
Step 1 – Write your benchmark#
Create a benchmark.py file (see medmcqa/benchmark.py for the full source). It should define the dataset source, prompt template, and scoring logic for the evaluation.
MedMCQA dataset fields:
question(exam question) ·opa..opd(answer options A-D) ·cop(correct option index, 0-3)The benchmark below uses
field_mappingto renameopa..opdtoa..dfor cleaner prompt placeholders, andcopas the target field.The
datasetspip package is listed inrequirementsin order to enable access to the HuggingFace dataset.
import re
from nemo_evaluator.contrib.byob import ScorerInput, benchmark, scorer
# Map HF integer answer codes to letters
_COP_TO_LETTER = {"0": "A", "1": "B", "2": "C", "3": "D"}
@benchmark(
name="medmcqa",
dataset="hf://openlifescienceai/medmcqa?split=validation",
prompt=(
"You are a medical expert taking a licensing examination.\n\n"
"Question: {question}\n\n"
"A) {a}\n"
"B) {b}\n"
"C) {c}\n"
"D) {d}\n\n"
"Answer with just the letter (A, B, C, or D):"
),
target_field="cop",
endpoint_type="chat",
requirements=["datasets"],
field_mapping={"opa": "a", "opb": "b", "opc": "c", "opd": "d"},
)
@scorer
def medmcqa_scorer(sample: ScorerInput) -> dict:
response_clean = sample.response.strip()
# Try: first character is A-D
if response_clean and response_clean[0].upper() in "ABCD":
predicted = response_clean[0].upper()
else:
# Try: find "answer is X" or standalone letter
match = re.search(
r"(?:answer\s+is\s+|^\s*\(?)\s*([A-Da-d])\b",
response_clean,
re.IGNORECASE,
)
if match:
predicted = match.group(1).upper()
else:
# Last resort: find any standalone A-D in first 50 chars
match = re.search(r"\b([A-Da-d])\b", response_clean[:50])
predicted = match.group(1).upper() if match else ""
# Convert HF integer target (0-3) to letter (A-D)
target_str = str(sample.target).strip()
target_letter = _COP_TO_LETTER.get(target_str, target_str.upper())
return {
"correct": predicted == target_letter,
"parsed": bool(predicted),
}
Step 2 – Compile#
From the medmcqa/ directory:
nemo-evaluator-byob benchmark.py
This compiles and auto-installs the package via pip install (no PYTHONPATH setup needed).
Step 3 – Run#
nemo-evaluator run_eval \
--eval_type byob_medmcqa.medmcqa \
--model_url http://localhost:8000 \
--model_id my-model \
--model_type chat \
--output_dir ./results \
--api_key_name API_KEY
Tip
Use nemo-evaluator-byob benchmark.py --dry-run to validate your benchmark without installing it.
Reference Documentation#
Define benchmarks with the @benchmark decorator.
Built-in scorers and custom scoring functions.
Judge-based evaluation with LLM models.
Dataset formats, HuggingFace URIs, and field mapping.
Compile, validate, list, and containerize benchmarks.
Package benchmarks as Docker images.
Examples#
Complete annotated examples are available in the source repository under packages/nemo-evaluator/examples/byob/:
MedMCQA – HuggingFace dataset with field mapping and custom letter-extraction scorer
Global MMLU Lite – Multilingual MMLU with per-category scoring breakdowns
TruthfulQA – LLM-as-Judge with custom template and
**template_kwargsMath Reasoning – Numeric extraction with tolerance comparison