About Evaluating#

NVIDIA NeMo Evaluator is part of the NVIDIA NeMo software suite for managing the AI agent lifecycle. Use NVIDIA NeMo Evaluator to run academic benchmarks, custom automated evaluations, and LLM-as-a-Judge on your large language models. You can also assess retriever and retrieval-augmented generation (RAG) pipelines.

NeMo Evaluator Workflow#

At a high level, the evaluation workflow consists of the following steps:

  1. Determine if your evaluation requires a custom dataset.

    Tip

    Refer to the manage entities tutorials for step-by-step instructions.

  2. Run an evaluation job by submitting a request to NeMo Evaluator.

  3. Get your results.


Installation Options#

You can choose one of the following deployment options or try out the minikube demo in the Get Started section.

Docker Compose

Deploy the NeMo Evaluator microservice using Docker. Easiest for local testing.

NeMo Evaluator Quickstart Using Docker Compose
Parent Helm Chart

Deploy the NeMo Evaluator microservice using the parent Helm chart.

Deploy NeMo Evaluator Using Parent Helm Chart
Full Platform Helm Chart

Install the full NeMo microservices platform.

Install NeMo Microservices as a Platform

Task Guides#

The following guides provide detailed information on how to perform common Nemo Evaluator tasks.

Targets

Create targets for evaluations.

Create and Manage Evaluation Targets
Configurations

Create configurations for evaluations.

References
Jobs

Create and run evaluation jobs.

Run and Manage Evaluation Jobs
Results

Get the results of your evaluation jobs.

Use the Results of Your Job
API Key Authentication

Secure authentication for external services in RAG and retriever evaluations.

API Key Authentication for RAG and Retriever Evaluations

Tutorials#

The following tutorials provide step-by-step instructions to complete specific evaluation goals.

Run an Academic LM Harness Eval

Run an academic LM Harness evaluation flow.

Run an Academic LM Harness Eval
Run an LLM Judge Eval

Learn how to evaluate a fine-tuned model using the LLM Judge metric with a custom dataset.

Run an LLM Judge Eval
Run an Eval as Part of a Fine-Tuning Workflow

Run evaluations before and after fine-tuning a model within a larger workflow.

Customize and Evaluate Large Language Models

References#

Review API specifications, compatibility guides, and troubleshooting resources to help you effectively use NeMo Evaluator.

Evaluation Flows#

Review configurations, data formats, and result examples for typical options provided by each evaluation flow.

Academic Benchmarks

Standard benchmarks for code generation, safety, reasoning, and tool-calling. ++ code-generation safety-evaluation reasoning-tasks

Academic Benchmarks
Retrieval

Evaluate document retrieval pipelines on standard or custom datasets. ++ recall@k ndcg@k

Retrieval Evaluation Flow
RAG

Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation). ++ recall@k faithfulness answer-relevancy

RAG Evaluation Flow
Agentic

Assess agent-based and multi-step reasoning models, including topic adherence and tool use. ++ topic-adherence tool-call-accuracy goal-accuracy

Agentic Evaluation Flow
LLM-as-a-Judge

Use another LLM to evaluate outputs with flexible scoring criteria. ++ judge-scoring flexible-metrics

LLM-as-a-Judge Evaluation Flow
Template

Create custom prompts, tasks, and metrics using Jinja2 templating. ++ jinja2-templates custom-prompts

Template Evaluation Flow
Prompt Optimization

Iteratively improve judge prompts using programmatic search over instructions and examples. ++ miprov2 bayesian-optimization

Prompt Optimization Task

Targets#

Set up and manage the different types of evaluation targets.

Data Source Targets

Configure evaluation targets using direct data input through rows or datasets for quick evaluations and testing

Data Source Targets
LLM Model Targets

Set up evaluation targets for LLM models, including NIM endpoints, chat endpoints, and offline pre-generated outputs

LLM Model Targets
Retriever Pipeline Targets

Configure retriever pipeline targets using embedding models and optional reranking for document retrieval

Retriever Pipeline Targets
RAG Pipeline Targets

Set up RAG pipeline targets combining retrieval and generation capabilities for comprehensive evaluations

RAG Pipeline Targets
Target Schema

Reference documentation for the JSON schema used to define evaluation targets

Target JSON Schema Reference

Configs#

Learn how to create and customize evaluation configurations for various evaluation types.

Config Schema

Reference documentation for the JSON schema used to define evaluation configurations

Evaluation Config Schema
Templating Reference

Guide for using Jinja2 templates in custom evaluation tasks and prompts

Template Evaluation Flow

Jobs#

Understand how to run evaluation jobs, including how to combine targets and configs.

Job Target and Config Matrix

Learn which evaluation targets and configurations can be combined for different evaluation types

Job Target and Configuration Matrix
Job Durations

View expected evaluation times for different model, hardware, and dataset combinations

Expected Evaluation Duration
Job Schema

Reference for the JSON structure and fields used when creating evaluation jobs

Job JSON Schema Reference
API Reference

View the NeMo Evaluator API reference.

Evaluator API
Troubleshooting

Troubleshoot issues that arise when you work with NeMo Evaluator.

Troubleshooting NeMo Evaluator