About Evaluating#

Evaluation is powered by NVIDIA NeMo Evaluator, a cloud-native microservice for evaluating large language models (LLMs), RAG pipelines, and AI agents at enterprise scale. The service provides automated workflows for over 100 academic benchmarks, LLM-as-a-judge scoring, and specialized metrics for RAG and agent systems.

NeMo Evaluator enables real-time evaluations of your LLM application through APIs, guiding you in refining and optimizing LLMs for enhanced performance and real-world applicability. The NeMo Evaluator APIs can be seamlessly automated within development pipelines, enabling faster iterations without the need for live data. It is cost effective and suitable for pre-deployment checks and regression testing.

NeMo Evaluator is part of the NVIDIA NeMo™ software suite.

See also

For a comprehensive overview of evaluation concepts, capabilities, and how NeMo Evaluator fits into the NeMo ecosystem, refer to Evaluation Concepts.


Get Started#

To begin using NeMo Evaluator, you need to deploy the microservice. Choose the deployment option that best fits your environment and use case.

Installation Options#

Select one of the following deployment methods based on your requirements.

Quickstart

Use Docker Compose for local experimentation, development, testing, or lightweight environments.

Deploy NeMo Evaluator with Docker
Minikube

Install the full NeMo microservices platform to a minikube Kubernetes cluster on your local machine.

Demo Cluster Setup on Minikube
Helm Chart

Deploy the NeMo Evaluator microservice using the Helm chart for production environment.

Deploy NeMo Evaluator Using Helm Chart

Tutorials#

After deploying NeMo Evaluator, use the following tutorials to learn how to accomplish common evaluation tasks. These step-by-step guides help you evaluate models using different flows and techniques.

Tip

The tutorials reference an EVALUATOR_BASE_URL whose value will depend on the ingress in your particular cluster. If you are using the minikube demo installation, it will be http://nemo.test. The demo installation’s for NIM_PROXY_BASE_URL is http://nim.test. Otherwise, you will need to consult with your own cluster administrator for the ingress values.

Run an Academic LM Harness Eval

Run an academic LM Harness evaluation flow.

Run an Academic LM Harness Eval
Run an LLM Judge Eval

Learn how to evaluate a fine-tuned model using the LLM Judge metric with a custom dataset.

Run an LLM Judge Eval
Run an Eval as Part of a Fine-Tuning Workflow

Run evaluations before and after fine-tuning a model within a larger workflow.

Customize and Evaluate Large Language Models

Understanding the Evaluation Workflow#

Before diving into specific evaluation flows, understand the general workflow for evaluating models with NeMo Evaluator. The evaluation process involves creating targets (what to evaluate), configs (how to evaluate), and jobs (running the evaluation).

High-Level Evaluation Process#

At a high level, the evaluation process consists of the following steps:

  1. (Optional) Prepare Custom Data: Determine if your evaluation requires a custom dataset.

    Tip

    Refer to the manage entities tutorials for step-by-step instructions on dataset management.

  2. Create Evaluation Targets and Configs: Set up evaluation targets (the models or pipelines to evaluate) and evaluation configs (the metrics and evaluation settings).

  3. Run an Evaluation Job: Submit an evaluation job by combining targets and configs in a request to NeMo Evaluator.

    Tip

    v2 API Available: The Evaluator API is available in both v1 (current) and v2 (preview). v2 introduces enhanced features like consolidated status information, real-time log access, and improved job structure. For production workloads, continue using v1 until v2 is fully supported. Refer to the v2 Migration Guide for upgrade guidance.

  4. Retrieve Results: Get your evaluation results to analyze model performance.


Evaluation Flows#

NeMo Evaluator supports multiple evaluation flows, each designed for specific evaluation tasks. An evaluation flow defines the type of evaluation (academic benchmarks, RAG, agentic, etc.) and determines which metrics and processing steps are applied.

Choose an evaluation flow based on what you are evaluating (LLMs, RAG pipelines, agents) and the metrics you need. Each flow includes pre-configured benchmarks and metrics tailored to specific use cases.

For detailed guidance on selecting the right flow, refer to Evaluation Flows.

Available Evaluation Flows#

Review configurations, data formats, and result examples for each evaluation flow.

Academic Benchmarks

Standard benchmarks for code generation, safety, reasoning, and tool-calling.

Academic Benchmarks
Retrieval

Evaluate document retrieval pipelines on standard or custom datasets.

Retrieval Evaluation Flow
RAG

Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation).

RAG Evaluation Flow
Agentic

Assess agent-based and multi-step reasoning models, including topic adherence and tool use.

Agentic Evaluation Flow
LLM-as-a-Judge

Use another LLM to evaluate outputs with flexible scoring criteria.

LLM-as-a-Judge Evaluation Flow
Template

Create custom prompts, tasks, and metrics using Jinja2 templating.

Template Evaluation Flow
Prompt Optimization

Iteratively improve judge prompts using programmatic search over instructions and examples.

Prompt Optimization Task

Work with Evaluation Targets#

Evaluation targets define what you want to evaluate. Targets can be LLM models, retriever pipelines, RAG pipelines, or direct data sources. Each target type supports different evaluation flows and metrics.

To reuse targets across multiple evaluation jobs, create at least one target that points to your model, pipeline, or data source.

Manage Targets#

Set up and manage the different types of evaluation targets.

Target Operations

Create targets for evaluations.

Create and Manage Evaluation Targets
Data Source Targets

Configure evaluation targets using direct data input through rows or datasets for quick evaluations and testing

Data Source Targets
LLM Model Targets

Set up evaluation targets for LLM models, including NIM endpoints, chat endpoints, and offline pre-generated outputs

LLM Model Targets
Retriever Pipeline Targets

Configure retriever pipeline targets using embedding models and optional reranking for document retrieval

Retriever Pipeline Targets
RAG Pipeline Targets

Set up RAG pipeline targets combining retrieval and generation capabilities for comprehensive evaluations

RAG Pipeline Targets
Target Schema

Reference documentation for the JSON schema used to define evaluation targets

Target JSON Schema Reference

Work with Evaluation Configs#

Evaluation configs specify how to evaluate your targets. A config defines the evaluation flow, metrics, datasets, and any additional parameters needed to run the evaluation. Different evaluation flows require different config structures.

Configs are separate from targets, allowing you to reuse the same target with multiple evaluation strategies or compare different configs against the same target.

Manage Configs#

Learn how to create and customize evaluation configurations for various evaluation types.

Config Operations

Create configurations for evaluations.

References
Config Schema

Reference documentation for the JSON schema used to define evaluation configurations

Evaluation Config Schema
Templating Reference

Guide for using Jinja2 templates in custom evaluation tasks and prompts

Template Evaluation Flow

Run Evaluation Jobs#

Evaluation jobs execute the actual evaluation by combining targets and configs. When you submit a job, NeMo Evaluator orchestrates the evaluation workflow, runs the specified metrics, and stores the results.

Jobs can be monitored in real-time and support various authentication methods for accessing external services. After a job completes, you can retrieve detailed results including metrics, logs, and performance data.

Manage Jobs#

Understand how to run evaluation jobs, including how to combine targets and configs.

Job Operations

Create and run evaluation jobs.

Run and Manage Evaluation Jobs
API Key Authentication

Secure authentication for external services in RAG and retriever evaluations.

API Key Authentication for RAG and Retriever Evaluations
Job Results

Get the results of your evaluation jobs.

Use the Results of Your Job
Job Target and Config Matrix

Learn which evaluation targets and configurations can be combined for different evaluation types

Job Target and Configuration Matrix
Job Durations

View expected evaluation times for different model, hardware, and dataset combinations

Expected Evaluation Duration
Job Schema

Reference for the JSON structure and fields used when creating evaluation jobs

Job JSON Schema Reference
API Reference

View the NeMo Evaluator API reference.

Evaluator API
Troubleshooting

Troubleshoot issues that arise when you work with NeMo Evaluator.

Troubleshooting NeMo Evaluator