Run Evaluations#
Follow step-by-step guides for different evaluation scenarios and methodologies in NeMo Evaluation.
Before You Start#
Ensure you have:
Completed the initial getting started guides for Installation Guide and Quickstart.
Chosen a Model Deployment option:
Launcher-Orchestrated Deployment (recommended)
Reviewed the evaluation parameters available for optimization.
# Core evaluation framework (pre-installed in NeMo container)
pip install nvidia-lm-eval==25.7.1
# Optional harnesses (install as needed)
pip install nvidia-simple-evals>=25.6 # Baseline/simple evaluations
pip install nvidia-bigcode-eval>=25.6 # Advanced code evaluation
pip install nvidia-safety-harness>=25.6 # Safety evaluation
pip install nvidia-bfcl>=25.6 # Function calling
pip install nvidia-eval-factory-garak>=25.6 # Security scanning
Some evaluations require additional authentication:
# Hugging Face token for gated datasets
export HF_TOKEN="your_hf_token"
# NVIDIA Build API key for judge models (safety evaluation)
export JUDGE_API_KEY="your_nvidia_api_key"
Evaluations#
Select an evaluation type to measure capabilities such as text generation, log-probability scoring, code generation, safety and security, and function calling.
Measure model performance through natural language generation for academic benchmarks, reasoning tasks, and general knowledge assessment.
Assess model confidence and uncertainty using log-probabilities for multiple-choice scenarios without text generation.
Measure programming capabilities through code generation, completion, and algorithmic problem solving.
Test AI safety, alignment, and security vulnerabilities using specialized safety harnesses and probing techniques.
Assess tool use capabilities, API calling accuracy, and structured output generation for agent-like behaviors.
Selection Guide#
Use this section to choose recommended evaluations by model type or by use case.
Model Type |
Recommended Evaluations |
---|---|
Base Models (Pre-trained) |
|
Instruction-Tuned Models |
|
Chat Models |
|
Use Case |
Recommended Evaluations |
---|---|
Academic Research |
|
Production Deployment |
|
Model Development |
|