About Evaluating#
Evaluation is powered by NVIDIA NeMo Evaluator, a cloud-native microservice for evaluating large language models (LLMs), RAG pipelines, and AI agents at enterprise scale. The service provides automated workflows for over 100 academic benchmarks, LLM-as-a-judge scoring, and specialized metrics for RAG and agent systems.
NeMo Evaluator enables real-time evaluations of your LLM application through APIs, guiding you in refining and optimizing LLMs for enhanced performance and real-world applicability. The NeMo Evaluator APIs can be seamlessly automated within development pipelines, enabling faster iterations without the need for live data. It is cost effective and suitable for pre-deployment checks and regression testing.
NeMo Evaluator is part of the NVIDIA NeMo™ software suite.
See also
For a comprehensive overview of evaluation concepts, capabilities, and how NeMo Evaluator fits into the NeMo ecosystem, refer to Evaluation Concepts.
Get Started#
To begin using NeMo Evaluator, you need to deploy the microservice. Choose the deployment option that best fits your environment and use case.
Installation Options#
Select one of the following deployment methods based on your requirements.
Use Docker Compose for local experimentation, development, testing, or lightweight environments.
Install the full NeMo microservices platform to a minikube Kubernetes cluster on your local machine.
Deploy the NeMo Evaluator microservice using the Helm chart for production environment.
Tutorials#
After deploying NeMo Evaluator, use the following tutorials to learn how to accomplish common evaluation tasks. These step-by-step guides help you evaluate models using different flows and techniques.
Tip
The tutorials reference an EVALUATOR_BASE_URL whose value will depend on the ingress in your particular cluster. If you are using the minikube demo installation, it will be http://nemo.test. The demo installation’s for NIM_PROXY_BASE_URL is http://nim.test. Otherwise, you will need to consult with your own cluster administrator for the ingress values.
Run an academic LM Harness evaluation flow.
Learn how to evaluate a fine-tuned model using the LLM Judge metric with a custom dataset.
Run evaluations before and after fine-tuning a model within a larger workflow.
Understanding the Evaluation Workflow#
Before diving into specific evaluation flows, understand the general workflow for evaluating models with NeMo Evaluator. The evaluation process involves creating targets (what to evaluate), configs (how to evaluate), and jobs (running the evaluation).
High-Level Evaluation Process#
At a high level, the evaluation process consists of the following steps:
(Optional) Prepare Custom Data: Determine if your evaluation requires a custom dataset.
Upload your dataset files to NeMo Data Store using Hugging Face CLI or SDK
Register the dataset in NeMo Entity Store using the Dataset APIs
Tip
Refer to the manage entities tutorials for step-by-step instructions on dataset management.
Create Evaluation Targets and Configs: Set up evaluation targets (the models or pipelines to evaluate) and evaluation configs (the metrics and evaluation settings).
Run an Evaluation Job: Submit an evaluation job by combining targets and configs in a request to NeMo Evaluator.
Tip
v2 API Available: The Evaluator API is available in both v1 (current) and v2 (preview). v2 introduces enhanced features like consolidated status information, real-time log access, and improved job structure. For production workloads, continue using v1 until v2 is fully supported. Refer to the v2 Migration Guide for upgrade guidance.
Retrieve Results: Get your evaluation results to analyze model performance.
Evaluation Flows#
NeMo Evaluator supports multiple evaluation flows, each designed for specific evaluation tasks. An evaluation flow defines the type of evaluation (academic benchmarks, RAG, agentic, etc.) and determines which metrics and processing steps are applied.
Choose an evaluation flow based on what you are evaluating (LLMs, RAG pipelines, agents) and the metrics you need. Each flow includes pre-configured benchmarks and metrics tailored to specific use cases.
For detailed guidance on selecting the right flow, refer to Evaluation Flows.
Available Evaluation Flows#
Review configurations, data formats, and result examples for each evaluation flow.
Standard benchmarks for code generation, safety, reasoning, and tool-calling.
Evaluate document retrieval pipelines on standard or custom datasets.
Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation).
Assess agent-based and multi-step reasoning models, including topic adherence and tool use.
Use another LLM to evaluate outputs with flexible scoring criteria.
Create custom prompts, tasks, and metrics using Jinja2 templating.
Iteratively improve judge prompts using programmatic search over instructions and examples.
Work with Evaluation Targets#
Evaluation targets define what you want to evaluate. Targets can be LLM models, retriever pipelines, RAG pipelines, or direct data sources. Each target type supports different evaluation flows and metrics.
To reuse targets across multiple evaluation jobs, create at least one target that points to your model, pipeline, or data source.
Manage Targets#
Set up and manage the different types of evaluation targets.
Create targets for evaluations.
Configure evaluation targets using direct data input through rows or datasets for quick evaluations and testing
Set up evaluation targets for LLM models, including NIM endpoints, chat endpoints, and offline pre-generated outputs
Configure retriever pipeline targets using embedding models and optional reranking for document retrieval
Set up RAG pipeline targets combining retrieval and generation capabilities for comprehensive evaluations
Reference documentation for the JSON schema used to define evaluation targets
Work with Evaluation Configs#
Evaluation configs specify how to evaluate your targets. A config defines the evaluation flow, metrics, datasets, and any additional parameters needed to run the evaluation. Different evaluation flows require different config structures.
Configs are separate from targets, allowing you to reuse the same target with multiple evaluation strategies or compare different configs against the same target.
Manage Configs#
Learn how to create and customize evaluation configurations for various evaluation types.
Create configurations for evaluations.
Reference documentation for the JSON schema used to define evaluation configurations
Guide for using Jinja2 templates in custom evaluation tasks and prompts
Run Evaluation Jobs#
Evaluation jobs execute the actual evaluation by combining targets and configs. When you submit a job, NeMo Evaluator orchestrates the evaluation workflow, runs the specified metrics, and stores the results.
Jobs can be monitored in real-time and support various authentication methods for accessing external services. After a job completes, you can retrieve detailed results including metrics, logs, and performance data.
Manage Jobs#
Understand how to run evaluation jobs, including how to combine targets and configs.
Create and run evaluation jobs.
Secure authentication for external services in RAG and retriever evaluations.
Get the results of your evaluation jobs.
Learn which evaluation targets and configurations can be combined for different evaluation types
View expected evaluation times for different model, hardware, and dataset combinations
Reference for the JSON structure and fields used when creating evaluation jobs
View the NeMo Evaluator API reference.
Troubleshoot issues that arise when you work with NeMo Evaluator.