About Evaluation#

Evaluate LLMs, VLMs, agentic systems, and retrieval models across 100+ benchmarks using unified workflows.

Before You Start#

Before you run evaluations, ensure you have:

  1. Chosen your approach: See Get Started for installation and setup guidance

  2. Deployed your model: See Serve and Deploy Models for deployment options

  3. OpenAI-compatible endpoint: Your model must expose a compatible API

  4. API credentials: Access tokens for your model endpoint


Quick Start: Academic Benchmarks#

Fastest path to evaluate academic benchmarks

For researchers and data scientists: Evaluate your model on standard academic benchmarks in 3 steps.

Step 1: Choose Your Approach

  • Launcher CLI (Recommended): nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct

  • Python API: Direct programmatic control with evaluate() function

Step 2: Select Benchmarks

Common academic suites:

  • Language Understanding: mmlu_pro, arc_challenge, hellaswag, truthfulqa

  • Mathematical Reasoning: gsm8k, math

  • Instruction Following: ifeval, gpqa_diamond

Discover all available tasks:

nv-eval ls tasks

Step 3: Run Evaluation

Using Launcher CLI:

nv-eval run \
    --config-dir examples \
    --config-name local_llama_3_1_8b_instruct \
    -o 'evaluation.tasks=["mmlu_pro", "gsm8k", "arc_challenge"]' \
    -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o target.api_endpoint.api_key=${YOUR_API_KEY}

Using Python API:

from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
    EvaluationConfig, EvaluationTarget, ApiEndpoint, ConfigParams, EndpointType
)

# Configure and run
eval_config = EvaluationConfig(
    type="mmlu_pro",
    output_dir="./results",
    params=ConfigParams(
        limit_samples=100,      # Start with subset
        temperature=0.01,       # Near-deterministic
        max_new_tokens=512,
        parallelism=4
    )
)

target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        model_id="meta/llama-3.1-8b-instruct",
        type=EndpointType.CHAT,
        api_key="YOUR_API_KEY"
    )
)

result = evaluate(eval_cfg=eval_config, target_cfg=target_config)

Next Steps:


Evaluation Workflows#

Select a workflow based on your environment and desired level of control.

Run Evaluations

Step-by-step guides for different evaluation scenarios using launcher, core API, and container workflows.

Run Evaluations
Launcher Workflows

Unified CLI for running evaluations across local, Slurm, and cloud backends with built-in result export.

NeMo Evaluator Launcher Quickstart
Core API Workflows

Programmatic evaluation using Python API for integration into ML pipelines and custom workflows.

Python API
Container Workflows

Direct container access for specialized use cases and custom evaluation environments.

Container Workflows

Configuration and Customization#

Configure your evaluations, create custom tasks, explore benchmarks, and extend the framework with these guides.

Configuration Parameters

Comprehensive reference for evaluation configuration parameters, optimization patterns, and framework-specific settings.

Evaluation Configuration Parameters
Custom Task Configuration

Learn how to configure evaluations for tasks without pre-defined configurations using custom benchmark definitions.

Custom Task Evaluation
Benchmark Catalog

Explore 100+ available benchmarks across 18 evaluation harnesses and their specific use cases.

Benchmark Catalog
Extend Framework

Add custom evaluation frameworks using Framework Definition Files for specialized benchmarks.

Framework Definition File (FDF)

Advanced Features#

Scale your evaluations, export results, customize adapters, and resolve issues with these advanced features.

Multi-Backend Execution

Run evaluations on local machines, HPC clusters, or cloud platforms with unified configuration.

Executors
Result Export

Export evaluation results to MLflow, Weights & Biases, Google Sheets, and other platforms.

Exporters
Adapter System

Configure request/response processing, logging, caching, and custom interceptors.

Interceptors
Troubleshooting

Resolve common evaluation issues, debug configuration problems, and optimize evaluation performance.

Troubleshooting

Core Evaluation Concepts#