Evaluation Concepts#

NVIDIA NeMo Evaluator is the one stop shop for evaluating your LLMs as part of the NeMo ecosystem. It enables real-time evaluations of your LLM application through APIs, guiding developers and researchers in refining and optimizing LLMs for enhanced performance and real-world applicability. The NeMo Evaluator APIs can be seamlessly automated within development pipelines, enabling faster iterations without the need for live data. It is cost effective and suitable for pre-deployment checks and regression testing.

The development of Large Language Models (LLMs) has become pivotal in shaping intelligent applications across various domains. Enterprises today have a large number of LLMs to choose from, and need a rigorous and systematic evaluation framework to choose the LLM that best suits their use case.

NVIDIA NeMo Evaluator supports evaluation of LLMs through academic benchmarks, custom automated evaluations, and LLM-as-a-Judge. Beyond LLM evaluation, NeMo Evaluator also supports evaluation of Retriever and RAG pipelines. For more information, see Evaluation Types.

NeMo Evaluator Use Cases#

The following table gives the use cases that NeMo Evaluator supports.

Evaluation Focus

Use Cases

NeMo Evaluator Documentation

Models

  • How do I evaluate my model?

  • I have many models, how do I choose the best one?

  • I want to compare models.

Evaluations

  • What are the different evaluation options available?

  • What evaluations should I run?

  • I want to make a model scorecard.

Data

  • What does the evaluation data look like?

  • I have my own data - How do I evaluate LLMs against this data?

NeMo Evaluator Interactions with Other Microservices#

The following diagram gives an overview of NeMo Evaluator’s interaction with other NeMo Microservices.

Evaluator Interaction with other NeMo Microservices