About Evaluation#
Evaluate LLMs, VLMs, agentic systems, and retrieval models across 100+ benchmarks using unified workflows.
Before You Start#
Before you run evaluations, ensure you have:
Chosen your approach: See Get Started for installation and setup guidance
Deployed your model: See Serve and Deploy Models for deployment options
OpenAI-compatible endpoint: Your model must expose a compatible API
API credentials: Access tokens for your model endpoint
Quick Start: Academic Benchmarks#
Fastest path to evaluate academic benchmarks
For researchers and data scientists: Evaluate your model on standard academic benchmarks in 3 steps.
Step 1: Choose Your Approach
Launcher CLI (Recommended):
nemo-evaluator-launcher run --config-dir packages/nemo-evaluator-launcher/examples --config-name local_llama_3_1_8b_instruct
Python API: Direct programmatic control with
evaluate()
function
Step 2: Select Benchmarks
Common academic suites:
Language Understanding:
mmlu_pro
,arc_challenge
,hellaswag
,truthfulqa
Mathematical Reasoning:
gsm8k
,math
Instruction Following:
ifeval
,gpqa_diamond
Discover all available tasks:
nemo-evaluator-launcher ls tasks
Step 3: Run Evaluation
Using Launcher CLI:
nemo-evaluator-launcher run \
--config-dir packages/nemo-evaluator-launcher/examples \
--config-name local_llama_3_1_8b_instruct \
-o 'evaluation.tasks=["mmlu_pro", "gsm8k", "arc_challenge"]' \
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o target.api_endpoint.api_key=${YOUR_API_KEY}
Using Python API:
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig, EvaluationTarget, ApiEndpoint, ConfigParams, EndpointType
)
# Configure and run
eval_config = EvaluationConfig(
type="mmlu_pro",
output_dir="./results",
params=ConfigParams(
limit_samples=100, # Start with subset
temperature=0.01, # Near-deterministic
max_new_tokens=512,
parallelism=4
)
)
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
url="https://integrate.api.nvidia.com/v1/chat/completions",
model_id="meta/llama-3.1-8b-instruct",
type=EndpointType.CHAT,
api_key="YOUR_API_KEY"
)
)
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
Next Steps:
Text Generation Evaluation - Complete text generation guide
Evaluation Configuration Parameters - Optimize configuration parameters
Benchmark Catalog - Explore all available benchmarks
Evaluation Workflows#
Select a workflow based on your environment and desired level of control.
Step-by-step guides for different evaluation scenarios using launcher, core API, and container workflows.
Unified CLI for running evaluations across local, Slurm, and cloud backends with built-in result export.
Programmatic evaluation using Python API for integration into ML pipelines and custom workflows.
Direct container access for specialized use cases and custom evaluation environments.
Configuration and Customization#
Configure your evaluations, create custom tasks, explore benchmarks, and extend the framework with these guides.
Comprehensive reference for evaluation configuration parameters, optimization patterns, and framework-specific settings.
Learn how to configure evaluations for tasks without pre-defined configurations using custom benchmark definitions.
Explore 100+ available benchmarks across 18 evaluation harnesses and their specific use cases.
Add custom evaluation frameworks using Framework Definition Files for specialized benchmarks.
Advanced Features#
Scale your evaluations, export results, customize adapters, and resolve issues with these advanced features.
Run evaluations on local machines, HPC clusters, or cloud platforms with unified configuration.
Export evaluation results to MLflow, Weights & Biases, Google Sheets, and other platforms.
Configure request/response processing, logging, caching, and custom interceptors.
Resolve common evaluation issues, debug configuration problems, and optimize evaluation performance.
Core Evaluation Concepts#
For architectural details and core concepts, refer to Evaluation Model.
For container specifications, refer to NeMo Evaluator Containers.