Create and Manage Evaluation Targets#
When you run an evaluation in NVIDIA NeMo Evaluator, you create a separate target and configuration for the evaluation.
Tip
Because NeMo Evaluator separates the target and the configuration, you can create a target once, and reuse it multiple times with different configurations (for example, to make a model scorecard). To see what targets and configurations are supported together, refer to Job Target and Configuration Matrix.
NeMo Evaluator provides evaluation capabilities for the following different target types:
Data Sources - Evaluate pre-generated outputs from any system
LLM Models - Evaluate language model responses in real-time
RAG Pipelines - Evaluate end-to-end retrieval-augmented generation systems
Retriever Pipelines - Evaluate document retrieval quality
Choosing the Right Evaluation Target#
Select your evaluation target based on what you want to evaluate:
Target Type |
When to Use |
Use Cases |
|---|---|---|
You have pre-generated outputs to evaluate |
• Agent outputs - Topic adherence, tool calls, goal accuracy (agentic flow) • LLM as judge (offline) - Evaluate pre-generated outputs (LLM as judge) • Custom metrics - BLEU, F1, ROUGE, exact match, etc. (metrics) |
|
You want to evaluate LLM responses in real-time |
• Academic benchmarks - Standard model evaluation (academic flow) • LLM as judge (online) - Real-time response evaluation (LLM as judge) |
|
You want to evaluate complete RAG systems end-to-end |
• RAG system testing - Evaluate retrieval + generation together • Context utilization - Measure context usage quality • Answer quality - Assess faithfulness and relevance (RAG flow) |
|
You want to evaluate document retrieval independently |
• Search quality - Measure retrieval relevance • Retrieval tuning - Compare embeddings and re-ranking (retriever flow) |
Important
For Agent Evaluation: Use the Data Source target type. Online evaluation of agents is not currently supported. Generate your agent outputs first, then evaluate them using the dataset target with appropriate agentic evaluation metrics.
Task Guides#
Perform common evaluation target tasks.
Tip
The tutorials reference an EVALUATOR_BASE_URL whose value will depend on the ingress in your particular cluster. If you are using the minikube demo installation, it will be http://nemo.test. Otherwise, you will need to consult with your own cluster administrator for the ingress values.
Create and submit a new evaluation target
Delete an existing evaluation target
References#
Review detailed specifications for configuring evaluation targets, including data sources, LLM models, RAG integrations, and retrieval pipelines.
Configure evaluation targets using direct data input through rows or datasets for quick evaluations and testing
Set up evaluation targets for LLM models, including NIM endpoints, chat endpoints, and offline pre-generated outputs
Configure retriever pipeline targets using embedding models and optional re-ranking for document retrieval
Set up RAG pipeline targets combining retrieval and generation capabilities for comprehensive evaluations
Complete auto-generated schema reference with all fields and nested properties