Run Evaluations#

Follow step-by-step guides for different evaluation scenarios and methodologies in NeMo Evaluation.

Before You Start#

Ensure you have:

  1. Completed the initial getting started guides for Installation Guide and Quickstart.

  2. Chosen a Model Deployment option:

  3. Reviewed the evaluation parameters available for optimization.

# Core evaluation framework (pre-installed in NeMo container)
pip install nvidia-lm-eval==25.7.1

# Optional harnesses (install as needed)
pip install nvidia-simple-evals>=25.6      # Baseline/simple evaluations
pip install nvidia-bigcode-eval>=25.6      # Advanced code evaluation  
pip install nvidia-safety-harness>=25.6    # Safety evaluation
pip install nvidia-bfcl>=25.6              # Function calling
pip install nvidia-eval-factory-garak>=25.6  # Security scanning

Some evaluations require additional authentication:

# Hugging Face token for gated datasets
export HF_TOKEN="your_hf_token"

# NVIDIA Build API key for judge models (safety evaluation)
export JUDGE_API_KEY="your_nvidia_api_key"

Evaluations#

Select an evaluation type to measure capabilities such as text generation, log-probability scoring, code generation, safety and security, and function calling.

Text Generation

Measure model performance through natural language generation for academic benchmarks, reasoning tasks, and general knowledge assessment.

Text Generation Evaluation
Log-Probability

Assess model confidence and uncertainty using log-probabilities for multiple-choice scenarios without text generation.

Log-Probability Evaluation
Code Generation

Measure programming capabilities through code generation, completion, and algorithmic problem solving.

Code Generation Evaluation
Safety & Security

Test AI safety, alignment, and security vulnerabilities using specialized safety harnesses and probing techniques.

Safety and Security Evaluation
Function Calling

Assess tool use capabilities, API calling accuracy, and structured output generation for agent-like behaviors.

Function Calling Evaluation

Selection Guide#

Use this section to choose recommended evaluations by model type or by use case.

Model Type

Recommended Evaluations

Base Models (Pre-trained)

Instruction-Tuned Models

Chat Models

  • All evaluation types with appropriate chat formatting

  • Conversational benchmarks and multi-turn evaluations

Use Case

Recommended Evaluations

Academic Research

Production Deployment

Model Development

  • Text Generation Evaluation for general capability assessment

  • Multiple evaluation types for comprehensive analysis

  • Custom benchmarks for specific improvements