About Evaluating#

Evaluation is powered by NVIDIA NeMo Evaluator, a cloud-native microservice for evaluating large language models (LLMs), RAG pipelines, and AI agents at enterprise scale. The service provides automated workflows for over 100 academic benchmarks, LLM-as-a-judge scoring, and specialized metrics for RAG and agent systems.

NeMo Evaluator enables real-time evaluations of your LLM application through APIs, guiding you in refining and optimizing LLMs for enhanced performance and real-world applicability. The NeMo Evaluator APIs can be seamlessly automated within development pipelines, enabling faster iterations without the need for live data. It is cost effective and suitable for pre-deployment checks and regression testing.

NeMo Evaluator is part of the NVIDIA NeMo™ software suite.

Get Started#

To begin using NeMo Evaluator, you need to deploy the microservice. Choose the deployment option that best fits your environment and use case.

Installation Options#

Select one of the following deployment methods based on your requirements.

Quickstart

Use Docker Compose for local experimentation, development, testing, or lightweight environments.

docker-compose standalone

Deploy NeMo Evaluator with Docker

Minikube

Install the full NeMo microservices platform to a minikube Kubernetes cluster on your local machine.

end-to-end kubernetes

Demo Cluster Setup on Minikube

Helm Chart

Deploy the NeMo Evaluator microservice using the Helm chart for production environment.

standalone

Deploy NeMo Evaluator Using Helm Chart

Tutorials#

After deploying NeMo Evaluator, use the following tutorials to learn how to accomplish common evaluation tasks. These step-by-step guides help you evaluate models using different flows and techniques.

Tip

The tutorials reference an EVALUATOR_BASE_URL whose value will depend on the ingress in your particular cluster. If you are using the minikube demo installation, it will be http://nemo.test. The demo installation’s for NIM_PROXY_BASE_URL is http://nim.test. Otherwise, you will need to consult with your own cluster administrator for the ingress values.

Run an Academic LM Harness Eval

Run an academic LM Harness evaluation flow.

academic-benchmark

Run an Academic LM Harness Eval

Run an LLM Judge Eval

Learn how to evaluate a fine-tuned model using the LLM Judge metric with a custom dataset.

custom-dataset

Run an LLM Judge Eval

Run an Eval as Part of a Fine-Tuning Workflow

Run evaluations before and after fine-tuning a model within a larger workflow.

nemo-customizer

Customize and Evaluate Large Language Models

Understanding the Evaluation Workflow#

Before diving into specific evaluation flows, understand the general workflow for evaluating models with NeMo Evaluator. The evaluation process involves creating targets (what to evaluate), configs (how to evaluate), and jobs (running the evaluation).

High-Level Evaluation Process#

At a high level, the evaluation process consists of the following steps:

(Optional) Prepare Custom Data: Determine if your evaluation requires a custom dataset.
- Upload your dataset files to NeMo Data Store using Hugging Face CLI or SDK
- Register the dataset in NeMo Entity Store using the Dataset APIs
Tip

Refer to the manage entities tutorials for step-by-step instructions on dataset management.
Create Evaluation Targets and Configs: Set up evaluation targets (the models or pipelines to evaluate) and evaluation configs (the metrics and evaluation settings).
Run an Evaluation Job: Submit an evaluation job by combining targets and configs in a request to NeMo Evaluator.

Tip

v2 API Available: The Evaluator API is available in both v1 (current) and v2 (preview). v2 introduces enhanced features like consolidated status information, real-time log access, and improved job structure. For production workloads, continue using v1 until v2 is fully supported. Refer to the v2 Migration Guide for upgrade guidance.
Retrieve Results: Get your evaluation results to analyze model performance.

Evaluation Flows#

NeMo Evaluator supports multiple evaluation flows, each designed for specific evaluation tasks. An evaluation flow defines the type of evaluation (academic benchmarks, RAG, agentic, etc.) and determines which metrics and processing steps are applied.

Choose an evaluation flow based on what you are evaluating (LLMs, RAG pipelines, agents) and the metrics you need. Each flow includes pre-configured benchmarks and metrics tailored to specific use cases.

For detailed guidance on selecting the right flow, refer to Evaluation Flows.

Available Evaluation Flows#

Review configurations, data formats, and result examples for each evaluation flow.

Academic Benchmarks

Standard benchmarks for code generation, safety, reasoning, and tool-calling.

code-generation safety-evaluation reasoning-tasks

Academic Benchmarks

Retrieval

Evaluate document retrieval pipelines on standard or custom datasets.

recall@k ndcg@k

Retrieval Evaluation Flow

RAG

Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation).

recall@k faithfulness answer-relevancy

RAG Evaluation Flow

Agentic

Assess agent-based and multi-step reasoning models, including topic adherence and tool use.

topic-adherence tool-call-accuracy goal-accuracy

Agentic Evaluation Flow

LLM-as-a-Judge

Use another LLM to evaluate outputs with flexible scoring criteria.

judge-scoring flexible-metrics

LLM-as-a-Judge Evaluation Flow

Template

Create custom prompts, tasks, and metrics using Jinja2 templating.

jinja2-templates custom-prompts

Template Evaluation Flow

Prompt Optimization

Iteratively improve judge prompts using programmatic search over instructions and examples.

miprov2 bayesian-optimization

Prompt Optimization Task

Work with Evaluation Targets#

Evaluation targets define what you want to evaluate. Targets can be LLM models, retriever pipelines, RAG pipelines, or direct data sources. Each target type supports different evaluation flows and metrics.

To reuse targets across multiple evaluation jobs, create at least one target that points to your model, pipeline, or data source.

Manage Targets#

Set up and manage the different types of evaluation targets.

Target Operations

Create targets for evaluations.

Create and Manage Evaluation Targets

Data Source Targets

Configure evaluation targets using direct data input through rows or datasets for quick evaluations and testing

Data Source Targets

LLM Model Targets

Set up evaluation targets for LLM models, including NIM endpoints, chat endpoints, and offline pre-generated outputs

LLM Model Targets

Retriever Pipeline Targets

Configure retriever pipeline targets using embedding models and optional reranking for document retrieval

Retriever Pipeline Targets

RAG Pipeline Targets

Set up RAG pipeline targets combining retrieval and generation capabilities for comprehensive evaluations

RAG Pipeline Targets

Target Schema

Reference documentation for the JSON schema used to define evaluation targets

Target JSON Schema Reference

Work with Evaluation Configs#

Evaluation configs specify how to evaluate your targets. A config defines the evaluation flow, metrics, datasets, and any additional parameters needed to run the evaluation. Different evaluation flows require different config structures.

Configs are separate from targets, allowing you to reuse the same target with multiple evaluation strategies or compare different configs against the same target.

Manage Configs#

Learn how to create and customize evaluation configurations for various evaluation types.

Config Operations

Create configurations for evaluations.

References

Config Schema

Reference documentation for the JSON schema used to define evaluation configurations

Evaluation Config Schema

Templating Reference

Guide for using Jinja2 templates in custom evaluation tasks and prompts

Template Evaluation Flow

Run Evaluation Jobs#

Evaluation jobs execute the actual evaluation by combining targets and configs. When you submit a job, NeMo Evaluator orchestrates the evaluation workflow, runs the specified metrics, and stores the results.

Jobs can be monitored in real-time and support various authentication methods for accessing external services. After a job completes, you can retrieve detailed results including metrics, logs, and performance data.

Manage Jobs#

Understand how to run evaluation jobs, including how to combine targets and configs.

Job Operations

Create and run evaluation jobs.

Run and Manage Evaluation Jobs

API Key Authentication

Secure authentication for external services in RAG and retriever evaluations.

API Key Authentication for RAG and Retriever Evaluations

Job Results

Get the results of your evaluation jobs.

Use the Results of Your Job

Job Target and Config Matrix

Learn which evaluation targets and configurations can be combined for different evaluation types

Job Target and Configuration Matrix

Job Durations

View expected evaluation times for different model, hardware, and dataset combinations

Expected Evaluation Duration

Job Schema

Reference for the JSON structure and fields used when creating evaluation jobs

Job JSON Schema Reference

API Reference

View the NeMo Evaluator API reference.

Evaluator API

Troubleshooting

Troubleshoot issues that arise when you work with NeMo Evaluator.

Troubleshooting NeMo Evaluator