Evaluation Flows#

Learn about the supported evaluation flows and their required configurations, data formats, and result schemas. Use these guides to select, configure, and run the right evaluation for your models and pipelines.

Prerequisites#


Choosing an Evaluation Flow#

Which evaluation flow to select for your job depends on what you are evaluating and how you would like to evaluate your model’s capability for a task:

What model task are you evaluating?

Evaluation Flow

Use Case

General Model Generation

LLM-as-a-Judge

Use another LLM to evaluate outputs with flexible scoring criteria

Custom Tasks & Templates

Template

Create custom prompts, tasks, and metrics using Jinja2 templating

Agent-based Reasoning

Agentic

Assess multi-step reasoning, tool use, and agent goal completion

Information Retrieval

Retrieval

Evaluate document retrieval pipelines

Retrieval + Generation

RAG

Evaluate complete RAG pipelines (retrieval + generation)

Academic Benchmarks

Academic Benchmarks

Standard benchmarks for code, safety, reasoning, and tool-calling


Evaluation Flows#

Each of the following references include an evaluation configuration, data format, and result example for typical options provided by each evaluation flow.

Academic Benchmarks

Standard benchmarks for code generation, safety, reasoning, and tool-calling.

Academic Benchmarks
Retrieval

Evaluate document retrieval pipelines on standard or custom datasets.

Retrieval Evaluation Flow
RAG

Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation).

RAG Evaluation Flow
Agentic

Assess agent-based and multi-step reasoning models, including tool use.

Agentic Evaluation Flow
LLM-as-a-Judge

Use another LLM to evaluate outputs with flexible scoring criteria.

LLM-as-a-Judge Evaluation Flow
Prompt Optimization

Iteratively improve judge prompts using programmatic search over instructions and examples.

Prompt Optimization Task
Template

Create custom prompts, tasks, and metrics using Jinja2 templating.

Template Evaluation Flow

Using Custom Data#

You can use your own custom datasets with NVIDIA NeMo Evaluator. NeMo Evaluator supports custom datasets that are formatted as json, jsonl, or csv. If a file is missing required fields or is not valid, an error will be raised during validation. All string fields must be non-null, but may be empty for optional fields (such as category or source).

The following evaluation flows support custom datasets:

Evaluation Use Case

Evaluation Flow

Agentic

agentic

LLM-as-a-Judge

llm-as-a-judge

Retrieval

retrieval

RAG

rag

Template

template

To use custom datasets with these evaluation flows:

  1. Upload your dataset to NeMo Data Store using the Hugging Face CLI or SDK

  2. Register your dataset in NeMo Entity Store using the Dataset APIs

  3. Reference in evaluation configs using the hf://datasets/{namespace}/{dataset-name} format

For a complete walkthrough, see the dataset management tutorials or the end-to-end evaluation example.