Evaluation Types#

Learn about the supported evaluation types and their required configurations, data formats, and result schemas. Use these guides to select, configure, and run the right evaluation for your models and pipelines.

Prerequisites#


Options#

Each of the following references include an evaluation configuration, data format, and result example for typical options provided by each evaluation type.

Agentic

Assess agent-based and multi-step reasoning models, including topic adherence and tool use.

Agentic Evaluation Types
BFCL

Evaluate tool-calling capabilities with the Berkeley Function Calling Leaderboard or your own dataset.

BFCL Evaluation Type
BigCode

Evaluate code generation models using the BigCode Evaluation Harness.

BigCode Evaluation Type
Custom

Flexible evaluation for custom tasks, metrics, and datasets.

Custom Evaluation Types
LM Harness

Run academic benchmarks for general language understanding and reasoning.

LM Harness Evaluation Type
RAG

Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation).

RAG Evaluation Type
Retriever

Evaluate document retrieval pipelines on standard or custom datasets.

Retriever Evaluation Type
Similarity Metrics

Compare model outputs to ground truth using BLEU, ROUGE, and other metrics.

Similarity Metrics Evaluation Type

Using Custom Data#

You can use your own custom datasets with NVIDIA NeMo Evaluator. NeMo Evaluator supports custom datasets that are formatted as json, jsonl, or csv. If a file is missing required fields or is not valid, an error will be raised during validation. All string fields must be non-null, but may be empty for optional fields (such as category or source).

The following evaluation types support custom datasets:

Evaluation Use Case

Evaluation Type Value

BFCL

bfcl

Similarity Metrics

similarity_metrics

LLM-as-a-Judge

custom

Retriever

retriever

RAG

rag