Evaluation Flows#

Learn about the supported evaluation flows and their required configurations, data formats, and result schemas. Use these guides to select, configure, and run the right evaluation for your models and pipelines.

Prerequisites#

Set up or select an existing evaluation target.
If using custom data, you must upload custom datasets to NeMo Data Store

Choosing an Evaluation Flow#

Which evaluation flow to select for your job depends on what you are evaluating and how you would like to evaluate your model’s capability for a task:

What model task are you evaluating?	Evaluation Flow	Use Case
General Model Generation	LLM-as-a-Judge	Use another LLM to evaluate outputs with flexible scoring criteria
Custom Tasks & Templates	Template	Create custom prompts, tasks, and metrics using Jinja2 templating
Agent-based Reasoning	Agentic	Assess multi-step reasoning, tool use, and agent goal completion
Information Retrieval	Retrieval	Evaluate document retrieval pipelines
Retrieval + Generation	RAG	Evaluate complete RAG pipelines (retrieval + generation)
Academic Benchmarks	Academic Benchmarks	Standard benchmarks for code, safety, reasoning, and tool-calling

Evaluation Flows#

Each of the following references include an evaluation configuration, data format, and result example for typical options provided by each evaluation flow.

Tip

The tutorials reference an EVALUATOR_BASE_URL whose value will depend on the ingress in your particular cluster. If you are using the minikube demo installation, it will be http://nemo.test. The demo installation’s for NIM_PROXY_BASE_URL is http://nim.test. Otherwise, you will need to consult with your own cluster administrator for the ingress values.

Academic Benchmarks

Standard benchmarks for code generation, safety, reasoning, and tool-calling.

code-generation safety-evaluation reasoning-tasks

Academic Benchmarks

Retrieval

Evaluate document retrieval pipelines on standard or custom datasets.

recall@k ndcg@k

Retrieval Evaluation Flow

RAG

Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation).

recall@k faithfulness answer-relevancy

RAG Evaluation Flow

Agentic

Assess agent-based and multi-step reasoning models, including tool use.

topic-adherence tool-call-accuracy goal-accuracy

Agentic Evaluation Flow

LLM-as-a-Judge

Use another LLM to evaluate outputs with flexible scoring criteria.

judge-scoring flexible-metrics creative-tasks

LLM-as-a-Judge Evaluation Flow

Prompt Optimization

Iteratively improve judge prompts using programmatic search over instructions and examples.

miprov2 bayesian-optimization

Prompt Optimization Task

Template

Create custom prompts, tasks, and metrics using Jinja2 templating.

jinja2-templates custom-prompts flexible-metrics

Template Evaluation Flow

Using Custom Data#

You can use your own custom datasets with NVIDIA NeMo Evaluator. NeMo Evaluator supports custom datasets that are formatted as json, jsonl, or csv. If a file is missing required fields or is not valid, an error will be raised during validation. All string fields must be non-null, but may be empty for optional fields (such as category or source).

The following evaluation flows support custom datasets:

Evaluation Use Case	Evaluation Flow
Agentic	`agentic`
LLM-as-a-Judge	`llm-as-a-judge`
Retrieval	`retrieval`
RAG	`rag`
Template	`template`

To use custom datasets with these evaluation flows:

Upload your dataset to NeMo Data Store using the Hugging Face CLI or SDK
Register your dataset in NeMo Entity Store using the Dataset APIs
Reference in evaluation configs using the hf://datasets/{namespace}/{dataset-name} format

For a complete walkthrough, see the dataset management tutorials or the end-to-end evaluation example.