About Evaluation#

Evaluate LLMs, VLMs, agentic systems, and retrieval models across 100+ benchmarks using unified workflows.

Before You Start#

Before you run evaluations, ensure you have:

Chosen your approach: See Get Started for installation and setup guidance
Deployed your model: See Serve and Deploy Models for deployment options
OpenAI-compatible endpoint: Your model must expose a compatible API (see Testing Endpoint Compatibility).
API credentials: Access tokens for your model endpoint and Hugging Face Hub.

Quick Start: Academic Benchmarks#

Fastest path to evaluate academic benchmarks

For researchers and data scientists: Evaluate your model on standard academic benchmarks in 3 steps.

Step 1: Choose Your Approach

Launcher CLI (Recommended): nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_basic.yaml
Python API: Direct programmatic control with evaluate() function

Step 2: Select Benchmarks

Common academic suites:

General Knowledge: mmlu_pro, gpqa_diamond
Mathematical Reasoning: AIME_2025, mgsm
Instruction Following: ifbench, mtbench

Discover all available tasks:

nemo-evaluator-launcher ls tasks

Step 3: Run Evaluation

Create config.yml:

defaults:
  - execution: local
  - deployment: none
  - _self_

evaluation:
  tasks:
    - name: mmlu_pro
    - name: ifbench

Launch the job:

export NGC_API_KEY=nvapi-...

nemo-evaluator-launcher run \
    --config ./config.yml \
    -o execution.output_dir=results \
    -o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o +target.api_endpoint.api_key_name=NGC_API_KEY

Evaluation Workflows#

Select a workflow based on your environment and desired level of control.

Launcher Workflows

Unified CLI for running evaluations across local, Slurm, and cloud backends with built-in result export.

NeMo Evaluator Launcher

Core API Workflows

Programmatic evaluation using Python API for integration into ML pipelines and custom workflows.

Python API

Container Workflows

Direct container access for specialized use cases and custom evaluation environments.

NeMo Evaluator Containers

Configuration and Customization#

Configure your evaluations, create custom tasks, explore benchmarks, and extend the framework with these guides.

Configuration Parameters

Comprehensive reference for evaluation configuration parameters and framework-specific settings.

Evaluation Configuration Parameters

Benchmark Catalog

Explore 100+ available benchmarks across 18 evaluation harnesses and their specific use cases.

About Selecting Benchmarks

Extend Framework

Add custom evaluation frameworks using Framework Definition Files for specialized benchmarks.

Framework Definition File (FDF)

Advanced Features#

Scale your evaluations, export results, customize adapters, and resolve issues with these advanced features.

Multi-Backend Execution

Run evaluations on local machines, HPC clusters, or cloud platforms with unified configuration.

Executors

Result Export

Export evaluation results to MLflow, Weights & Biases, Google Sheets, and other platforms.

Exporters

Adapter System

Configure request/response processing, logging, caching, and custom interceptors.

Interceptors

Core Evaluation Concepts#

For architectural details and core concepts, refer to Evaluation Model.
For container specifications, refer to NeMo Evaluator Containers.