NeMo Evaluator Documentation#

LLM evaluation framework: benchmark environments, pluggable solvers, multi-format reporting.


Get Started#

Installation

Install from source and configure extras (scoring, skills, harbor, proxy).

Installation
Quickstart

Run your first evaluation in under 5 minutes.

Quickstart

Features#

  • Everything is an Environment. Built-in benchmarks, NeMo Skills, Gym remotes, lm-eval tasks, and VLMEvalKit datasets all resolve through one registry.

  • @benchmark + @scorer. Define a complete benchmark in under 10 lines of Python.

  • Pluggable solvers. simple, harbor, tool_calling, gym_delegation, openclaw — swap inference strategy per benchmark via config.

  • Cluster backends. Run locally, in Docker, or on SLURM clusters with automatic model deployment.

  • Resilient suites. Per-benchmark checkpointing with failure isolation. Resume partially completed suites with --resume.

  • Statistical regression. Compare runs with McNemar’s exact test, paired flip analysis, and confidence intervals. Gate releases across benchmark suites with per-benchmark policy thresholds.

  • 15 built-in benchmarks. MMLU, MMLU-Pro, MATH-500, GPQA, GSM8K, DROP, MGSM, TriviaQA, HumanEval, SimpleQA, HealthBench, PinchBench, XSTest, SWE-bench Verified, SWE-bench Multilingual.

Tutorials#

Interactive Walkthrough

Guided tour of all main features through real config examples — start here.

Interactive Walkthrough
Write Your Own Benchmark

Define a complete benchmark with @benchmark + @scorer in under 10 lines.

Write Your Own Benchmark (BYOB)
Gym Integration

Serve benchmarks for NeMo Gym training and consume remote Gym environments.

Gym Integration
NeMo Skills Integration

Use NeMo Skills benchmarks with full per-request observability.

NeMo Skills Integration
Distributed Evaluation

Scale to thousands of problems with SLURM, Kubernetes, Ray, or manual sharding.

Distributed Evaluation
Compare Runs

Diagnose what changed between two runs of the same benchmark with nel compare.

Comparing Evaluation Runs
Quality Gates

Turn benchmark thresholds into a suite-level GO / NO-GO / INCONCLUSIVE decision with nel gate.

Implementing Quality Gates

Architecture & Deployment#

Architecture

How the system works: environments, solvers, execution modes, and observability.

Architecture
Deployment Guide

Deploy on SLURM, Docker, Kubernetes, Ray, and CI/CD pipelines.

Deployment Guide
Benchmarks

All 15 built-in benchmarks with scoring details.

Built-in Benchmarks
API Reference

Python API and CLI reference.

API Reference