Built-in Benchmarks#
All 15 built-in benchmarks are defined with @benchmark + @scorer in src/nemo_evaluator/benchmarks/.
Quick Reference#
Benchmark |
Command |
Scoring |
Type |
|---|---|---|---|
MMLU |
|
|
Multichoice (4-way) |
MMLU-Pro |
|
|
Multichoice (10-way) |
MATH-500 |
|
|
Math |
GPQA Diamond |
|
|
Multichoice (shuffled) |
GSM8K |
|
|
Math reasoning |
DROP |
|
|
Reading comprehension |
MGSM |
|
|
Multilingual math |
TriviaQA |
|
|
Factual QA |
HumanEval |
|
|
Code generation (Docker) |
SimpleQA |
|
|
Factuality (LLM judge) |
HealthBench |
|
|
Health (LLM judge) |
PinchBench |
|
|
Agentic tasks (code/LLM judge) |
XSTest |
|
|
Safety |
SWE-bench Verified |
|
|
Software engineering (Docker) |
SWE-bench Multilingual |
|
|
Software engineering, multi-lang (Docker) |
Extended Environments#
Beyond the 15 built-in benchmarks, NEL resolves additional environment types via URI schemes and namespace prefixes:
Syntax |
Source |
Example |
|---|---|---|
|
Built-in registry |
|
|
lm-evaluation-harness |
|
|
NeMo Skills |
|
|
VLMEvalKit |
|
|
Remote Gym server |
|
|
Legacy container |
|
Benchmark Details#
MMLU#
Massive Multitask Language Understanding – 14K 4-choice questions across 57 subjects.
Dataset:
hf://cais/mmlu?config=all&split=testScorer:
multichoice_regex– extracts letter (A-D) from “Answer: X” patternprepare_row: Unpackschoiceslist intoA,B,C,Dfields; maps numericanswerto letter
MMLU-Pro#
10-choice variant of MMLU with harder distractors.
Dataset:
hf://TIGER-Lab/MMLU-Pro?split=testScorer:
multichoice_regexwith extended pattern[A-J]prepare_row: Pads choices to 10 slots
MATH-500#
500 competition-level math problems.
Dataset:
hf://HuggingFaceH4/MATH-500?split=testScorer:
answer_line– extracts answer after “Answer:” and normalizes
GPQA Diamond#
Graduate-level science questions with shuffled answer choices.
Dataset:
hf://Idavidrein/gpqa?config=gpqa_diamond&split=trainScorer:
multichoice_regexprepare_row: Shuffles choices with seeded RNG to prevent position bias
GSM8K#
1,319 grade-school math problems requiring multi-step reasoning.
Dataset:
hf://openai/gsm8k?split=testScorer:
numeric_match– extracts last number from response
DROP#
Reading comprehension with discrete reasoning (counting, sorting, arithmetic).
Dataset:
hf://ucinlp/drop?split=validationScorer:
fuzzy_matchwith answer aliases
MGSM#
Multilingual GSM8K – math problems in 10 languages.
Dataset:
hf://juletxara/mgsm?split=testScorer:
numeric_match
TriviaQA#
Trivia questions with multiple acceptable answer aliases.
Dataset:
hf://trivia_qa?config=rc.nocontext&split=validationScorer:
fuzzy_match– normalized substring matching against aliases
HumanEval#
164 Python function-completion problems with test suites.
Dataset:
hf://openai/openai_humaneval?split=testScorer:
code_sandbox– extracts code from markdown fences, runs in Docker with network isolation, memory limits, and timeoutsRequires: Docker daemon
SimpleQA#
Short-form factuality questions requiring LLM-as-judge scoring.
Dataset:
hf://basicv8vc/SimpleQA?split=testScorer:
needs_judge– flags samples for post-processing by the judge pipeline
HealthBench#
Medical accuracy questions requiring LLM-as-judge scoring.
Dataset:
hf://openai/HealthBench?split=testScorer:
needs_judge
Adding a New Benchmark#
See the Write Your Own Benchmark (BYOB) tutorial for a complete walkthrough of the @benchmark + @scorer API.