Evaluation Types#

Learn about the supported evaluation types and their required configurations, data formats, and result schemas. Use these guides to select, configure, and run the right evaluation for your models and pipelines.

Prerequisites#

Choosing an Evaluation Type#

Which evaluation type to select for your job depends on what you are evaluating. Some evaluation types support evaluation of your model using academic benchmarks or you can bring your own proprietary data as custom datasets. The table below outlines the types to select from based on how you would like to evaluate your model’s capability for a task.

What model task are you evaluating?

Evaluate with Academic Benchmarks

Evaluate with Custom Datasets

General Model Generation

LM Eval Harness (gsm8k, ifeval, truthfulqa, etc)

Custom (LLM-as-a-Judge, similarity metrics like BLEU, ROUGE, F1, string check, etc.)
Agentic with LLM-as-a-Judge (topic adherence, agent goal accuracy)

Code

BigCode (humaneval, mbpp, etc.)

Function Calling

BFCL (simple, multiple, etc.)

BFCL (simple, multiple, etc.)
Agentic (tool call accuracy)
Custom (tool call accuracy)

Information Retrieval & Answer Generation

Retriever (BEIR, FiQA, etc.)
RAG (SQuAD, BEIR, RAGAS with judge models)

Retriever (BEIR)
RAG (SQuAD, BEIR, RAGAS with judge models)


Options#

Each of the following references include an evaluation configuration, data format, and result example for typical options provided by each evaluation type.

Agentic

Assess agent-based and multi-step reasoning models, including topic adherence and tool use.

Agentic Evaluation Types
BFCL

Evaluate tool-calling capabilities with the Berkeley Function Calling Leaderboard or your own dataset.

BFCL Evaluation Type
BigCode

Evaluate code generation models using the BigCode Evaluation Harness.

BigCode Evaluation Type
Custom

Flexible evaluation for custom tasks, metrics, and datasets.

Custom Evaluation Types
LM Harness

Run academic benchmarks for general language understanding and reasoning.

LM Harness Evaluation Type
RAG

Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation).

RAG Evaluation Type
Retriever

Evaluate document retrieval pipelines on standard or custom datasets.

Retriever Evaluation Type

Using Custom Data#

You can use your own custom datasets with NVIDIA NeMo Evaluator. NeMo Evaluator supports custom datasets that are formatted as json, jsonl, or csv. If a file is missing required fields or is not valid, an error will be raised during validation. All string fields must be non-null, but may be empty for optional fields (such as category or source).

The following evaluation types support custom datasets:

Evaluation Use Case

Evaluation Type Value

BFCL

bfcl

LLM-as-a-Judge

custom

Retriever

retriever

RAG

rag