NeMo Evaluator Documentation#

LLM evaluation framework: benchmark environments, pluggable solvers, multi-format reporting.

Get Started#

Installation

Install from source and configure extras (scoring, skills, harbor, proxy).

Quickstart

Run your first evaluation in under 5 minutes.

Everything is an Environment. Built-in benchmarks, NeMo Skills, Gym remotes, lm-eval tasks, and VLMEvalKit datasets all resolve through one registry.
@benchmark + @scorer. Define a complete benchmark in under 10 lines of Python.
Pluggable solvers. simple, harbor, tool_calling, gym_delegation, openclaw — swap inference strategy per benchmark via config.
Cluster backends. Run locally, in Docker, or on SLURM clusters with automatic model deployment.
Resilient suites. Per-benchmark checkpointing with failure isolation. Resume partially completed suites with --resume.
Statistical regression. Compare runs with McNemar’s exact test, paired flip analysis, and confidence intervals. Gate releases across benchmark suites with per-benchmark policy thresholds.
15 built-in benchmarks. MMLU, MMLU-Pro, MATH-500, GPQA, GSM8K, DROP, MGSM, TriviaQA, HumanEval, SimpleQA, HealthBench, PinchBench, XSTest, SWE-bench Verified, SWE-bench Multilingual.