Evaluation Flows#
Learn about the supported evaluation flows and their required configurations, data formats, and result schemas. Use these guides to select, configure, and run the right evaluation for your models and pipelines.
Prerequisites#
Set up or select an existing evaluation target.
If using custom data, you must upload custom datasets to NeMo Data Store
Choosing an Evaluation Flow#
Which evaluation flow to select for your job depends on what you are evaluating and how you would like to evaluate your model’s capability for a task:
What model task are you evaluating? |
Evaluation Flow |
Use Case |
---|---|---|
General Model Generation |
Use another LLM to evaluate outputs with flexible scoring criteria |
|
Custom Tasks & Templates |
Create custom prompts, tasks, and metrics using Jinja2 templating |
|
Agent-based Reasoning |
Assess multi-step reasoning, tool use, and agent goal completion |
|
Information Retrieval |
Evaluate document retrieval pipelines |
|
Retrieval + Generation |
Evaluate complete RAG pipelines (retrieval + generation) |
|
Academic Benchmarks |
Standard benchmarks for code, safety, reasoning, and tool-calling |
Evaluation Flows#
Each of the following references include an evaluation configuration, data format, and result example for typical options provided by each evaluation flow.
Standard benchmarks for code generation, safety, reasoning, and tool-calling.
Evaluate document retrieval pipelines on standard or custom datasets.
Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation).
Assess agent-based and multi-step reasoning models, including tool use.
Use another LLM to evaluate outputs with flexible scoring criteria.
Iteratively improve judge prompts using programmatic search over instructions and examples.
Create custom prompts, tasks, and metrics using Jinja2 templating.
Using Custom Data#
You can use your own custom datasets with NVIDIA NeMo Evaluator. NeMo Evaluator supports custom datasets that are formatted as json
, jsonl
, or csv
. If a file is missing required fields or is not valid, an error will be raised during validation. All string fields must be non-null, but may be empty for optional fields (such as category
or source
).
The following evaluation flows support custom datasets:
Evaluation Use Case |
Evaluation Flow |
---|---|
Agentic |
|
LLM-as-a-Judge |
|
Retrieval |
|
RAG |
|
Template |
|
To use custom datasets with these evaluation flows:
Upload your dataset to NeMo Data Store using the Hugging Face CLI or SDK
Register your dataset in NeMo Entity Store using the Dataset APIs
Reference in evaluation configs using the
hf://datasets/{namespace}/{dataset-name}
format
For a complete walkthrough, see the dataset management tutorials or the end-to-end evaluation example.