About Evaluating#

NVIDIA NeMo Evaluator supports evaluation of LLMs through academic benchmarks, custom automated evaluations, and LLM-as-a-Judge. Beyond LLM evaluation, NeMo Evaluator also supports evaluation of Retriever and RAG pipelines.

Typical NeMo Evaluator Workflow#

A typical NeMo Evaluator workflow looks like the following:

Note

NeMo Evaluator depends on NVIDIA NIM for LLMs and NeMo Data Store.

(Optional) If you are using a custom dataset for evaluation, upload it to NeMo Data Store before you run an evaluation.
Create an evaluation target in NeMo Evaluator.
Create an evaluation configuration in NeMo Evaluator.
Run an evaluation job by submitting a request to NeMo Evaluator.
1. NeMo Evaluator downloads custom data, if any, from NeMo Data Store.
2. NeMo Evaluator runs inference with NIM for LLMs, Embeddings, and Reranking, depending on the model being evaluated.
3. NeMo Evaluator writes the results, including generations, logs, and metrics to NeMo Data Store.
4. NeMo Evaluator returns the results.
Get your results.

For more information, see Run and Manage Evaluation Jobs.

Installation Options#

You can choose one of the following deployment options or try out the minikube demo in the Get Started section.

Docker Compose

Deploy the NeMo Evaluator microservice using Docker. Easiest for local testing.

standalone

Deploy NeMo Evaluator Using Docker Compose

Parent Helm Chart

Deploy the NeMo Evaluator microservice using the parent Helm chart.

standalone

Deploy NeMo Evaluator Using Parent Helm Chart

Full Platform Helm Chart

Install the full NeMo microservices platform.

end-to-end

Install NeMo Microservices as a Platform

Task Guides#

The following guides provide detailed information on how to perform common Nemo Evaluator tasks.

Targets

Create targets for evaluations.

Create and Manage Evaluation Targets

Configurations

Create configurations for evaluations.

References

Jobs

Create and run evaluation jobs.

Run and Manage Evaluation Jobs

Results

Get the results of your evaluation jobs.

Use the Results of Your Job

Tutorials#

The following tutorials provide step-by-step instructions to complete specific evaluation goals.

Run a Simple Evaluation

Learn how to run an evaluation.

Run a Simple Evaluation

Evaluate a Fine-tuned Model

Learn how to evaluate a fine-tuned model.

Customize the Evaluation Loop

References#

Review API specifications, compatibility guides, and troubleshooting resources to help you effectively use NeMo Evaluator.

Evaluations#

Review configurations, data formats, and result examples for typical options provided by each evaluation type.

Agentic

Assess agent-based and multi-step reasoning models, including topic adherence and tool use.

topic-adherence tool-call-accuracy goal-accuracy

Agentic Evaluation Types

BFCL

Evaluate tool-calling capabilities with the Berkeley Function Calling Leaderboard or your own dataset.

tool-call-accuracy pass@k

BFCL Evaluation Type

BigCode

Evaluate code generation models using the BigCode Evaluation Harness.

bleu pass@k code-correctness

BigCode Evaluation Type

Custom

Flexible evaluation for custom tasks, metrics, and datasets.

bleu string-check llm-as-judge custom

Custom Evaluation Types

LM Harness

Run academic benchmarks for general language understanding and reasoning.

accuracy bleu em f1

LM Harness Evaluation Type

RAG

Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation).

recall@k ndcg@k faithfulness answer-relevancy

RAG Evaluation Type

Retriever

Evaluate document retrieval pipelines on standard or custom datasets.

recall@k ndcg@k

Retriever Evaluation Type

Targets#

Set up and manage the different types of evaluation targets.

Data Source Targets

Configure evaluation targets using direct data input through rows or datasets for quick evaluations and testing

Data Source Targets

LLM Model Targets

Set up evaluation targets for LLM models, including NIM endpoints, chat endpoints, and offline pre-generated outputs

LLM Model Targets

Retriever Pipeline Targets

Configure retriever pipeline targets using embedding models and optional reranking for document retrieval

Retriever Pipeline Targets

RAG Pipeline Targets

Set up RAG pipeline targets combining retrieval and generation capabilities for comprehensive evaluations

RAG Pipeline Targets

Target Schema

Reference documentation for the JSON schema used to define evaluation targets

Target JSON Schema Reference

Configs#

Learn how to create and customize evaluation configurations for various evaluation types.

Config Schema

Reference documentation for the JSON schema used to define evaluation configurations

Evaluation Config Schema

Jobs#

Understand how to run evaluation jobs, including how to combine targets and configs.

Job Target and Config Matrix

Learn which evaluation targets and configurations can be combined for different evaluation types

Job Target and Configuration Matrix

Job Durations

View expected evaluation times for different model, hardware, and dataset combinations

Expected Evaluation Duration

Job Schema

Reference for the JSON structure and fields used when creating evaluation jobs

Job JSON Schema Reference

API Reference

View the NeMo Evaluator API reference.

Evaluator API

Troubleshooting

Troubleshoot issues that arise when you work with NeMo Evaluator.

Troubleshooting NeMo Evaluator