Create and Manage Evaluation Targets#

When you run an evaluation in NVIDIA NeMo Evaluator, you create a separate target and configuration for the evaluation.

Tip

Because NeMo Evaluator separates the target and the configuration, you can create a target once, and reuse it multiple times with different configurations (for example, to make a model scorecard). To see what targets and configurations are supported together, refer to Job Target and Configuration Matrix.

NeMo Evaluator provides evaluation capabilities for the following different target types:

  • Data Sources - Evaluate pre-generated outputs from any system

  • LLM Models - Evaluate language model responses in real-time

  • RAG Pipelines - Evaluate end-to-end retrieval-augmented generation systems

  • Retriever Pipelines - Evaluate document retrieval quality


Choosing the Right Evaluation Target#

Select your evaluation target based on what you want to evaluate:

Target Type

When to Use

Use Cases

Data Source

You have pre-generated outputs to evaluate

Agent outputs - Topic adherence, tool calls, goal accuracy (agentic flow)

LLM as judge (offline) - Evaluate pre-generated outputs (LLM as judge)

Custom metrics - BLEU, F1, ROUGE, exact match, etc. (metrics)

Model

You want to evaluate LLM responses in real-time

Academic benchmarks - Standard model evaluation (academic flow)

LLM as judge (online) - Real-time response evaluation (LLM as judge)

RAG Pipeline

You want to evaluate complete RAG systems end-to-end

RAG system testing - Evaluate retrieval + generation together

Context utilization - Measure context usage quality

Answer quality - Assess faithfulness and relevance (RAG flow)

Retriever Pipeline

You want to evaluate document retrieval independently

Search quality - Measure retrieval relevance

Retrieval tuning - Compare embeddings and re-ranking (retriever flow)

Important

For Agent Evaluation: Use the Data Source target type. Online evaluation of agents is not currently supported. Generate your agent outputs first, then evaluate them using the dataset target with appropriate agentic evaluation metrics.


Task Guides#

Perform common evaluation target tasks.

Tip

The tutorials reference an EVALUATOR_BASE_URL whose value will depend on the ingress in your particular cluster. If you are using the minikube demo installation, it will be http://nemo.test. Otherwise, you will need to consult with your own cluster administrator for the ingress values.

Create Target

Create and submit a new evaluation target

Create Evaluation Target
Delete Target

Delete an existing evaluation target

Delete Evaluation Target

References#

Review detailed specifications for configuring evaluation targets, including data sources, LLM models, RAG integrations, and retrieval pipelines.

Data Source Targets

Configure evaluation targets using direct data input through rows or datasets for quick evaluations and testing

Data Source Targets
LLM Model Targets

Set up evaluation targets for LLM models, including NIM endpoints, chat endpoints, and offline pre-generated outputs

LLM Model Targets
Retriever Pipeline Targets

Configure retriever pipeline targets using embedding models and optional re-ranking for document retrieval

Retriever Pipeline Targets
RAG Pipeline Targets

Set up RAG pipeline targets combining retrieval and generation capabilities for comprehensive evaluations

RAG Pipeline Targets
Schema Reference

Complete auto-generated schema reference with all fields and nested properties

Evaluation Target Schema Reference