About Evaluating#
NVIDIA NeMo Evaluator is part of the NVIDIA NeMo software suite for managing the AI agent lifecycle. Use NVIDIA NeMo Evaluator to run academic benchmarks, custom automated evaluations, and LLM-as-a-Judge on your large language models. You can also assess retriever and retrieval-augmented generation (RAG) pipelines.
NeMo Evaluator Workflow#
At a high level, the evaluation workflow consists of the following steps:
Determine if your evaluation requires a custom dataset.
Upload your dataset files to NeMo Data Store using Hugging Face CLI or SDK
Register the dataset in NeMo Entity Store using the Dataset APIs
Tip
Refer to the manage entities tutorials for step-by-step instructions.
Run an evaluation job by submitting a request to NeMo Evaluator.
Get your results.
Installation Options#
You can choose one of the following deployment options or try out the minikube demo in the Get Started section.
Deploy the NeMo Evaluator microservice using Docker. Easiest for local testing.
Deploy the NeMo Evaluator microservice using the parent Helm chart.
Install the full NeMo microservices platform.
Task Guides#
The following guides provide detailed information on how to perform common Nemo Evaluator tasks.
Create targets for evaluations.
Create configurations for evaluations.
Create and run evaluation jobs.
Get the results of your evaluation jobs.
Secure authentication for external services in RAG and retriever evaluations.
Tutorials#
The following tutorials provide step-by-step instructions to complete specific evaluation goals.
Run an academic LM Harness evaluation flow.
Learn how to evaluate a fine-tuned model using the LLM Judge metric with a custom dataset.
Run evaluations before and after fine-tuning a model within a larger workflow.
References#
Review API specifications, compatibility guides, and troubleshooting resources to help you effectively use NeMo Evaluator.
Evaluation Flows#
Review configurations, data formats, and result examples for typical options provided by each evaluation flow.
Standard benchmarks for code generation, safety, reasoning, and tool-calling. ++ code-generation safety-evaluation reasoning-tasks
Evaluate document retrieval pipelines on standard or custom datasets. ++ recall@k ndcg@k
Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation). ++ recall@k faithfulness answer-relevancy
Assess agent-based and multi-step reasoning models, including topic adherence and tool use. ++ topic-adherence tool-call-accuracy goal-accuracy
Use another LLM to evaluate outputs with flexible scoring criteria. ++ judge-scoring flexible-metrics
Create custom prompts, tasks, and metrics using Jinja2 templating. ++ jinja2-templates custom-prompts
Iteratively improve judge prompts using programmatic search over instructions and examples. ++ miprov2 bayesian-optimization
Targets#
Set up and manage the different types of evaluation targets.
Configure evaluation targets using direct data input through rows or datasets for quick evaluations and testing
Set up evaluation targets for LLM models, including NIM endpoints, chat endpoints, and offline pre-generated outputs
Configure retriever pipeline targets using embedding models and optional reranking for document retrieval
Set up RAG pipeline targets combining retrieval and generation capabilities for comprehensive evaluations
Reference documentation for the JSON schema used to define evaluation targets
Configs#
Learn how to create and customize evaluation configurations for various evaluation types.
Reference documentation for the JSON schema used to define evaluation configurations
Guide for using Jinja2 templates in custom evaluation tasks and prompts
Jobs#
Understand how to run evaluation jobs, including how to combine targets and configs.
Learn which evaluation targets and configurations can be combined for different evaluation types
View expected evaluation times for different model, hardware, and dataset combinations
Reference for the JSON structure and fields used when creating evaluation jobs
View the NeMo Evaluator API reference.
Troubleshoot issues that arise when you work with NeMo Evaluator.