Evaluation Types#
Learn about the supported evaluation types and their required configurations, data formats, and result schemas. Use these guides to select, configure, and run the right evaluation for your models and pipelines.
Prerequisites#
Set up or select an existing evaluation target.
If using custom data, you must upload custom datasets to NeMo Data Store
Choosing an Evaluation Type#
Which evaluation type to select for your job depends on what you are evaluating. Some evaluation types support evaluation of your model using academic benchmarks or you can bring your own proprietary data as custom datasets. The table below outlines the types to select from based on how you would like to evaluate your model’s capability for a task.
What model task are you evaluating? |
Evaluate with Academic Benchmarks |
Evaluate with Custom Datasets |
---|---|---|
General Model Generation |
LM Eval Harness (gsm8k, ifeval, truthfulqa, etc) |
Custom (LLM-as-a-Judge, similarity metrics like BLEU, ROUGE, F1, string check, etc.) |
Code |
BigCode (humaneval, mbpp, etc.) |
|
Function Calling |
BFCL (simple, multiple, etc.) |
BFCL (simple, multiple, etc.) |
Information Retrieval & Answer Generation |
Retriever (BEIR, FiQA, etc.) |
Options#
Each of the following references include an evaluation configuration, data format, and result example for typical options provided by each evaluation type.
Assess agent-based and multi-step reasoning models, including topic adherence and tool use.
Evaluate tool-calling capabilities with the Berkeley Function Calling Leaderboard or your own dataset.
Evaluate code generation models using the BigCode Evaluation Harness.
Flexible evaluation for custom tasks, metrics, and datasets.
Run academic benchmarks for general language understanding and reasoning.
Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation).
Evaluate document retrieval pipelines on standard or custom datasets.
Using Custom Data#
You can use your own custom datasets with NVIDIA NeMo Evaluator. NeMo Evaluator supports custom datasets that are formatted as json
, jsonl
, or csv
. If a file is missing required fields or is not valid, an error will be raised during validation. All string fields must be non-null, but may be empty for optional fields (such as category
or source
).
The following evaluation types support custom datasets:
Evaluation Use Case |
Evaluation Type Value |
---|---|
BFCL |
|
LLM-as-a-Judge |
|
Retriever |
|
RAG |
|