Evaluation Flows#
Learn about the supported evaluation flows and their required configurations, data formats, and result schemas. Use these guides to select, configure, and run the right evaluation for your models and pipelines.
Prerequisites#
- Set up or select an existing evaluation target. 
- If using custom data, you must upload custom datasets to NeMo Data Store 
Choosing an Evaluation Flow#
Which evaluation flow to select for your job depends on what you are evaluating and how you would like to evaluate your model’s capability for a task:
| What model task are you evaluating? | Evaluation Flow | Use Case | 
|---|---|---|
| General Model Generation | Use another LLM to evaluate outputs with flexible scoring criteria | |
| Custom Tasks & Templates | Create custom prompts, tasks, and metrics using Jinja2 templating | |
| Agent-based Reasoning | Assess multi-step reasoning, tool use, and agent goal completion | |
| Information Retrieval | Evaluate document retrieval pipelines | |
| Retrieval + Generation | Evaluate complete RAG pipelines (retrieval + generation) | |
| Academic Benchmarks | Standard benchmarks for code, safety, reasoning, and tool-calling | 
Evaluation Flows#
Each of the following references include an evaluation configuration, data format, and result example for typical options provided by each evaluation flow.
Tip
The tutorials reference an EVALUATOR_BASE_URL whose value will depend on the ingress in your particular cluster. If you are using the minikube demo installation, it will be http://nemo.test. The demo installation’s for NIM_PROXY_BASE_URL is http://nim.test. Otherwise, you will need to consult with your own cluster administrator for the ingress values.
Standard benchmarks for code generation, safety, reasoning, and tool-calling.
Evaluate document retrieval pipelines on standard or custom datasets.
Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation).
Assess agent-based and multi-step reasoning models, including tool use.
Use another LLM to evaluate outputs with flexible scoring criteria.
Iteratively improve judge prompts using programmatic search over instructions and examples.
Create custom prompts, tasks, and metrics using Jinja2 templating.
Using Custom Data#
You can use your own custom datasets with NVIDIA NeMo Evaluator. NeMo Evaluator supports custom datasets that are formatted as json, jsonl,  or csv. If a file is missing required fields or is not valid, an error will be raised during validation. All string fields must be non-null, but may be empty for optional fields (such as category or source).
The following evaluation flows support custom datasets:
| Evaluation Use Case | Evaluation Flow | 
|---|---|
| Agentic | 
 | 
| LLM-as-a-Judge | 
 | 
| Retrieval | 
 | 
| RAG | 
 | 
| Template | 
 | 
To use custom datasets with these evaluation flows:
- Upload your dataset to NeMo Data Store using the Hugging Face CLI or SDK 
- Register your dataset in NeMo Entity Store using the Dataset APIs 
- Reference in evaluation configs using the - hf://datasets/{namespace}/{dataset-name}format
For a complete walkthrough, see the dataset management tutorials or the end-to-end evaluation example.