About Evaluating#
NVIDIA NeMo Evaluator supports evaluation of LLMs through academic benchmarks, custom automated evaluations, and LLM-as-a-Judge. Beyond LLM evaluation, NeMo Evaluator also supports evaluation of Retriever and RAG pipelines.
Typical NeMo Evaluator Workflow#
A typical NeMo Evaluator workflow looks like the following:
Note
NeMo Evaluator depends on NVIDIA NIM for LLMs and NeMo Data Store.
(Optional) If you are using a custom dataset for evaluation, upload it to NeMo Data Store before you run an evaluation.
Create an evaluation target in NeMo Evaluator.
Create an evaluation configuration in NeMo Evaluator.
Run an evaluation job by submitting a request to NeMo Evaluator.
NeMo Evaluator downloads custom data, if any, from NeMo Data Store.
NeMo Evaluator runs inference with NIM for LLMs, Embeddings, and Reranking, depending on the model being evaluated.
NeMo Evaluator writes the results, including generations, logs, and metrics to NeMo Data Store.
NeMo Evaluator returns the results.
Get your results.
For more information, see Run and Manage Evaluation Jobs.
Task Guides#
The following guides provide detailed information on how to perform common Nemo Evaluator tasks.
Create targets for evaluations.
Create configurations for evaluations.
Create and run evaluation jobs.
Get the results of your evaluation jobs.
Tutorials#
The following tutorials provide step-by-step instructions to complete specific evaluation goals.
Learn how to run an evaluation.
Learn how to evaluate a fine-tuned model.
References#
Review API specifications, compatibility guides, and troubleshooting resources to help you effectively use NeMo Evaluator.
Evaluations#
Review configurations, data formats, and result examples for typical options provided by each evaluation type.
Assess agent-based and multi-step reasoning models, including topic adherence and tool use.
Evaluate tool-calling capabilities with the Berkeley Function Calling Leaderboard or your own dataset.
Evaluate code generation models using the BigCode Evaluation Harness.
Flexible evaluation for custom tasks, metrics, and datasets.
Run academic benchmarks for general language understanding and reasoning.
Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation).
Evaluate document retrieval pipelines on standard or custom datasets.
Compare model outputs to ground truth using BLEU, ROUGE, and other metrics.
Targets#
Set up and manage the different types of evaluation targets.
Configure evaluation targets using direct data input through rows or datasets for quick evaluations and testing
Set up evaluation targets for LLM models, including NIM endpoints, chat endpoints, and offline pre-generated outputs
Configure retriever pipeline targets using embedding models and optional reranking for document retrieval
Set up RAG pipeline targets combining retrieval and generation capabilities for comprehensive evaluations
Reference documentation for the JSON schema used to define evaluation targets
Configs#
Learn how to create and customize evaluation configurations for various evaluation types.
Reference documentation for the JSON schema used to define evaluation configurations
Reference documentation for the JSON schema used to define evaluation configurations
Jobs#
Understand how to run evaluation jobs, including how to combine targets and configs.
Learn which evaluation targets and configurations can be combined for different evaluation types
View expected evaluation times for different model, hardware, and dataset combinations
Reference for the JSON structure and fields used when creating evaluation jobs
View the NeMo Evaluator API reference.
Troubleshoot issues that arise when you work with NeMo Evaluator.