Evaluation Types#

Learn about the supported evaluation types and their required configurations, data formats, and result schemas. Use these guides to select, configure, and run the right evaluation for your models and pipelines.

Prerequisites#

Set up or select an existing evaluation target.
If using custom data, you must upload custom datasets to NeMo Data Store

Choosing an Evaluation Type#

Which evaluation type to select for your job depends on what you are evaluating. Some evaluation types support evaluation of your model using academic benchmarks or you can bring your own proprietary data as custom datasets. The table below outlines the types to select from based on how you would like to evaluate your model’s capability for a task.

What model task are you evaluating?	Evaluate with Academic Benchmarks	Evaluate with Custom Datasets
General Model Generation	LM Eval Harness (gsm8k, ifeval, truthfulqa, etc)	Custom (LLM-as-a-Judge, similarity metrics like BLEU, ROUGE, F1, string check, etc.) Agentic with LLM-as-a-Judge (topic adherence, agent goal accuracy)
Code	BigCode (humaneval, mbpp, etc.)
Function Calling	BFCL (simple, multiple, etc.)	BFCL (simple, multiple, etc.) Agentic (tool call accuracy) Custom (tool call accuracy)
Information Retrieval & Answer Generation	Retriever (BEIR, FiQA, etc.) RAG (SQuAD, BEIR, RAGAS with judge models)	Retriever (BEIR) RAG (SQuAD, BEIR, RAGAS with judge models)

Options#

Each of the following references include an evaluation configuration, data format, and result example for typical options provided by each evaluation type.

Agentic

Assess agent-based and multi-step reasoning models, including topic adherence and tool use.

topic-adherence tool-call-accuracy goal-accuracy

Agentic Evaluation Types

BFCL

Evaluate tool-calling capabilities with the Berkeley Function Calling Leaderboard or your own dataset.

tool-call-accuracy pass@k

BFCL Evaluation Type

BigCode

Evaluate code generation models using the BigCode Evaluation Harness.

bleu pass@k code-correctness

BigCode Evaluation Type

Custom

Flexible evaluation for custom tasks, metrics, and datasets.

bleu string-check llm-as-judge custom

Custom Evaluation Types

LM Harness

Run academic benchmarks for general language understanding and reasoning.

accuracy bleu em f1

LM Harness Evaluation Type

RAG

Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation).

recall@k ndcg@k faithfulness answer-relevancy

RAG Evaluation Type

Retriever

Evaluate document retrieval pipelines on standard or custom datasets.

recall@k ndcg@k

Retriever Evaluation Type

Using Custom Data#

You can use your own custom datasets with NVIDIA NeMo Evaluator. NeMo Evaluator supports custom datasets that are formatted as json, jsonl, or csv. If a file is missing required fields or is not valid, an error will be raised during validation. All string fields must be non-null, but may be empty for optional fields (such as category or source).

The following evaluation types support custom datasets:

Evaluation Use Case	Evaluation Type Value
BFCL	`bfcl`
LLM-as-a-Judge	`custom`
Retriever	`retriever`
RAG	`rag`