Evaluation Types#
Learn about the supported evaluation types and their required configurations, data formats, and result schemas. Use these guides to select, configure, and run the right evaluation for your models and pipelines.
Prerequisites#
Set up or select an existing evaluation target.
If using custom data, you must upload custom datasets to NeMo Data Store
Options#
Each of the following references include an evaluation configuration, data format, and result example for typical options provided by each evaluation type.
Assess agent-based and multi-step reasoning models, including topic adherence and tool use.
Evaluate tool-calling capabilities with the Berkeley Function Calling Leaderboard or your own dataset.
Evaluate code generation models using the BigCode Evaluation Harness.
Flexible evaluation for custom tasks, metrics, and datasets.
Run academic benchmarks for general language understanding and reasoning.
Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation).
Evaluate document retrieval pipelines on standard or custom datasets.
Compare model outputs to ground truth using BLEU, ROUGE, and other metrics.
Using Custom Data#
You can use your own custom datasets with NVIDIA NeMo Evaluator. NeMo Evaluator supports custom datasets that are formatted as json
, jsonl
, or csv
. If a file is missing required fields or is not valid, an error will be raised during validation. All string fields must be non-null, but may be empty for optional fields (such as category
or source
).
The following evaluation types support custom datasets:
Evaluation Use Case |
Evaluation Type Value |
---|---|
BFCL |
|
Similarity Metrics |
|
LLM-as-a-Judge |
|
Retriever |
|
RAG |
|