Evaluation Concepts#

NVIDIA NeMo Evaluator is the one stop shop for evaluating your LLMs as part of the NeMo ecosystem.

It enables real-time evaluations of your LLM application through APIs, guiding developers and researchers in refining and optimizing LLMs for enhanced performance and real-world applicability. The NeMo Evaluator APIs can be seamlessly automated within development pipelines, enabling faster iterations without the need for live data. It is cost effective and suitable for pre-deployment checks and regression testing.

The development of Large Language Models (LLMs) has become pivotal in shaping intelligent applications across various domains. Enterprises today have a large number of LLMs to choose from, and need a rigorous and systematic evaluation framework to choose the LLM that best suits their use case.

NVIDIA NeMo Evaluator supports evaluation of LLMs through academic benchmarks, custom automated evaluations, and LLM-as-a-Judge. Beyond LLM evaluation, NeMo Evaluator also supports evaluation of Retriever and RAG pipelines. For more information, see Evaluation Types.

NeMo Evaluator Use Cases#

The following table gives the use cases that NeMo Evaluator supports.

Evaluation Focus	Use Cases	NeMo Evaluator Documentation
Models	How do I evaluate my model? I have many models, how do I choose the best one? I want to compare models.	LLM Model Targets Retriever Pipeline Targets RAG Pipeline Targets
Evaluations	What are the different evaluation options available? What evaluations should I run? I want to make a model scorecard.	BigCode Evaluation Type LM Harness Evaluation Type LLM-as-a-Judge Retriever Evaluation Type RAG Evaluation Type
Data	What does the evaluation data look like? I have my own data - How do I evaluate LLMs against this data?	Use Custom Data

NeMo Evaluator Interactions with Other Microservices#

The following diagram gives an overview of NeMo Evaluator’s interaction with other NeMo Microservices.

Evaluator Interaction with other NeMo Microservices