Evaluation Types#

Evaluating large language models (LLMs) is a complex and evolving field. NVIDIA NeMo Evaluator supports evaluation of LLMs through academic benchmarks, custom automated evaluations, and LLM-as-a-Judge. Beyond LLM evaluation, NeMo Evaluator also supports evaluation of retriever and RAG pipelines.

Tip

If you are ready to start creating an evaluation job, refer to Run and Manage Evaluation Jobs.

NeMo Evaluator supports several popular academic benchmarks and tasks, including the following. You can also use the generative alternative of a task, such as mmlu_str instead of mmlu.

Note

For the full list of BigCode tasks, refer to tasks. For the full list of LM Harness tasks, refer to tasks.

General Reasoning
- BBH (Hard)
- HellaSwag
- Truthful QA
- Winogrande
- ARC
Math Reasoning
- GSM8K
- MATH
Coding
- HumanEval(code)
Multilingual
- MGSM
Instruction-following
- IFEVAL

BigCode Evaluation Harness#

BigCode Evaluation Harness is a framework for the evaluation of code generation models. It has seven code generation Python tasks including HumanEval, HumanEval+, MBPP, MBPP+, and others. It also supports HumanEvalPack extended to six languages, Pal, MultiPL-E. For the full list of tasks, refer to tasks.

LM Evaluation Harness#

LM Evaluation Harness supports over 60 standard academic benchmarks for LLMs, including MMLU, GSM8K, and hellaswag. For the full list of tasks, refer to tasks.

Note

Currently, NeMo Evaluator does not support tasks that require logprobs. If you submit a task that requires logprobs, you will see the error message NIM doesn't support logprobs-based evaluations.

Similarity Metrics#

Similarity Metrics evaluation enables evaluating a model on custom datasets, by comparing the LLM generated response with a ground truth response. The comparison methodology is decided by the scorers (metrics) selected.

Similarity Metrics evaluation is best suited for use cases where the LLM generations are not expected to be highly creative. This is because the automated scorers used here, such as F1 score and Rouge scores, compare LLM generations against a ground truth ideal response. This comparison is a good indicator of LLM performance only if the use case is not very generative.

For example, if the use case is story generation, there are so many stories that can be generated that comparing LLM generations against a ground truth, and penalizing it for not generating the exact story as the ground truth doesn’t make sense.

Similarity Metrics Evaluation can be run on Foundational, Aligned or Fine-tuned models.

Users can choose to run inference on an input file through NeMo Evaluator, or to bring in an output file containing generated answers.

LLM-as-a-Judge#

With LLM-as-a-Judge, an LLM can be evaluated by using another LLM as a judge. MT-Bench is an LLM-as-a-Judge evaluation framework commonly adopted by the community. MT-Bench comes with it’s own dataset and allows for OpenAI models as judges.

NeMo Evaluator extends these capabilities in the following ways:

You can use your own custom data. For more information, refer to Use Custom Data with NVIDIA NeMo Evaluator.
You can use any NIM model as a judge model.

LLM-as-a-Judge can be used for any use case, even highly generative ones, but the choice of judge is crucial in getting reliable evaluations. The judge model should have enough domain knowledge of the use case, compared to the model being evaluated, to be an effective judge. Generally, an LLM regarded as a high quality model should be used as the judge.

LLM-as-a-Judge Evaluation can be run on Foundational, Aligned or Fine-tuned models.

Limitations#

The following are limitations of LLM-as-a-Judge Evaluations:

Currently, NeMo Evaluator only supports single-mode evaluation, where the Judge LLM is asked to directly assign scores to the answers produced by the User LLM. NeMo Evaluator does not support pair-wise model evaluations.
Currently, only one metric is generated per judgement. To generate multiple metrics, you must run multiple evaluation jobs.
LLM as a Judge is designed for chat models. A default chat template is applied for non-chat models.

Default Chat Template#

For non-chat models that do not use a chat template, NeMo Evaluator uses a default chat template in the following formats:

Default prompt format for non-chat user models

A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.

User: Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.

Assistant:

Default prompt format for non-chat judge models

You are a helpful assistant.

User: [Instruction]
Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".

[Question]
Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.

[The Start of Assistant's Answer]
<User LLM answer>
[The End of Assistant's Answer]

Assistant:

Retriever Pipelines#

Retriever pipelines are used to retrieve relevant documents based on a query. NeMo Evaluator provides support for evaluation of retriever pipelines on standard academic datasets and custom datasets. For more information about the supported retriever pipelines, refer to Retriever Pipeline Targets.

Before you can run a retriever evaluation, you must set up a Milvus document store. For more information, refer to Configure Milvus.

For standard academic datasets, only the BEIR format is supported.

RAG Pipelines#

Retrieval Augmented Generation (RAG) pipelines are built by pipelining NeMo Retriever and LLM. A retriever pipeline is used to retrieve relevant documents based on a query, and the LLM is used to generate answers based on the query and the retrieved documents. NeMo Evaluator provides support for evaluating RAG pipelines. For more information about the supported RAG pipelines, refer to RAG Pipeline Targets.

Before you can run a RAG evaluation that has a retriever step, you must set up a Milvus document store. For more information, refer to Configure Milvus.

For standard academic datasets, only the BEIR format is supported.

Best Practices#

The following are some best practices when you work with NeMo Evaluator:

For models that do not support chat functionality, we recommend that you use Academic Benchmarks, such as LM Evaluation Harness and BigCode Evaluation Harness, and Similarity Metrics Evaluation with appropriate custom data.
For non-chat models, we recommend that you do not use LLM-as-a-Judge, because this benchmark relies on a conversational structure, which is best leveraged by chat models.
For chat models you can use the standard MT-Bench dataset or a Similarity Metrics Evaluation with appropriate custom data.
You can use LLM-as-a-Judge evaluation for chat-enabled models in NIM.

Next Steps#

To start creating an evaluation job, refer to Run and Manage Evaluation Jobs. To learn about evaluation targets, refer to Create and Manage Evaluation Targets. To learn about evaluation configurations, refer to Create and Manage Evaluation Configurations.