Built-in Benchmarks#

All 15 built-in benchmarks are defined with @benchmark + @scorer in src/nemo_evaluator/benchmarks/.

Quick Reference#

Benchmark	Command	Scoring	Type
MMLU	`nel eval run --bench mmlu`	`multichoice_regex`	Multichoice (4-way)
MMLU-Pro	`nel eval run --bench mmlu_pro`	`multichoice_regex`	Multichoice (10-way)
MATH-500	`nel eval run --bench math500`	`answer_line`	Math
GPQA Diamond	`nel eval run --bench gpqa`	`multichoice_regex`	Multichoice (shuffled)
GSM8K	`nel eval run --bench gsm8k`	`numeric_match`	Math reasoning
DROP	`nel eval run --bench drop`	`fuzzy_match`	Reading comprehension
MGSM	`nel eval run --bench mgsm`	`numeric_match`	Multilingual math
TriviaQA	`nel eval run --bench triviaqa`	`fuzzy_match`	Factual QA
HumanEval	`nel eval run --bench humaneval`	`code_sandbox`	Code generation (Docker)
SimpleQA	`nel eval run --bench simpleqa`	`needs_judge`	Factuality (LLM judge)
HealthBench	`nel eval run --bench healthbench`	`needs_judge`	Health (LLM judge)
PinchBench	`nel eval run --bench pinchbench`	`code_sandbox` / `needs_judge`	Agentic tasks (code/LLM judge)
XSTest	`nel eval run --bench xstest`	`needs_judge`	Safety
SWE-bench Verified	`nel eval run --bench swebench-verified`	`swebench_score`	Software engineering (Docker)
SWE-bench Multilingual	`nel eval run --bench swebench-multilingual`	`swebench_score`	Software engineering, multi-lang (Docker)

Beyond the 15 built-in benchmarks, NEL resolves additional environment types via URI schemes and namespace prefixes:

Syntax	Source	Example
`nel eval run --bench <name>`	Built-in registry	`nel eval run --bench mmlu`
`nel eval run --bench lm-eval://<task>`	lm-evaluation-harness	`nel eval run --bench lm-eval://aime25`
`nel eval run --bench skills://<name>`	NeMo Skills	`nel eval run --bench skills://mmlu-pro`
`nel eval run --bench vlmevalkit://<dataset>`	VLMEvalKit	`nel eval run --bench vlmevalkit://MMBench_DEV_EN`
`nel eval run --bench gym://<host:port>`	Remote Gym server	`nel eval run --bench gym://localhost:9090`
`nel eval run --bench container://<image>#<task>`	Legacy container	`nel eval run --bench container://nvcr.io/image#task`

Massive Multitask Language Understanding – 14K 4-choice questions across 57 subjects.

Dataset: hf://cais/mmlu?config=all&split=test
Scorer: multichoice_regex – extracts letter (A-D) from “Answer: X” pattern
prepare_row: Unpacks choices list into A, B, C, D fields; maps numeric answer to letter

10-choice variant of MMLU with harder distractors.

500 competition-level math problems.

Graduate-level science questions with shuffled answer choices.

1,319 grade-school math problems requiring multi-step reasoning.

Reading comprehension with discrete reasoning (counting, sorting, arithmetic).

Multilingual GSM8K – math problems in 10 languages.

Trivia questions with multiple acceptable answer aliases.

164 Python function-completion problems with test suites.

Dataset: hf://openai/openai_humaneval?split=test
Scorer: code_sandbox – extracts code from markdown fences, runs in Docker with network isolation, memory limits, and timeouts
Requires: Docker daemon

Short-form factuality questions requiring LLM-as-judge scoring.

Dataset: hf://basicv8vc/SimpleQA?split=test
Scorer: needs_judge – flags samples for post-processing by the judge pipeline

Medical accuracy questions requiring LLM-as-judge scoring.

See the Write Your Own Benchmark (BYOB) tutorial for a complete walkthrough of the @benchmark + @scorer API.