NeMo Evaluator Documentation#
LLM evaluation framework: benchmark environments, pluggable solvers, multi-format reporting.
Get Started#
Install from source and configure extras (scoring, skills, harbor, proxy).
Run your first evaluation in under 5 minutes.
Features#
Everything is an Environment. Built-in benchmarks, NeMo Skills, Gym remotes, lm-eval tasks, and VLMEvalKit datasets all resolve through one registry.
@benchmark+@scorer. Define a complete benchmark in under 10 lines of Python.Pluggable solvers.
simple,harbor,tool_calling,gym_delegation,openclaw— swap inference strategy per benchmark via config.Cluster backends. Run locally, in Docker, or on SLURM clusters with automatic model deployment.
Resilient suites. Per-benchmark checkpointing with failure isolation. Resume partially completed suites with
--resume.Statistical regression. Compare runs with McNemar’s exact test, paired flip analysis, and confidence intervals. Gate releases across benchmark suites with per-benchmark policy thresholds.
15 built-in benchmarks. MMLU, MMLU-Pro, MATH-500, GPQA, GSM8K, DROP, MGSM, TriviaQA, HumanEval, SimpleQA, HealthBench, PinchBench, XSTest, SWE-bench Verified, SWE-bench Multilingual.
Tutorials#
Guided tour of all main features through real config examples — start here.
Define a complete benchmark with @benchmark + @scorer in under 10 lines.
Serve benchmarks for NeMo Gym training and consume remote Gym environments.
Use NeMo Skills benchmarks with full per-request observability.
Scale to thousands of problems with SLURM, Kubernetes, Ray, or manual sharding.
Diagnose what changed between two runs of the same benchmark with nel compare.
Turn benchmark thresholds into a suite-level GO / NO-GO / INCONCLUSIVE decision with nel gate.
Architecture & Deployment#
How the system works: environments, solvers, execution modes, and observability.
Deploy on SLURM, Docker, Kubernetes, Ray, and CI/CD pipelines.
All 15 built-in benchmarks with scoring details.
Python API and CLI reference.