Academic Benchmarks#

Academic benchmarks provide standardized evaluation methods for comparing model performance across different capabilities. These benchmarks are widely used in the research community and provide reliable, reproducible metrics for model assessment.

Tip

Want to experiment first? You can try these benchmarks using the open-source NeMo Evaluator SDK before deploying the microservice. The SDK provides a lightweight way to test evaluation workflows locally.

Prerequisites#

Set up or select an existing evaluation target.
For some benchmarks, you may need HuggingFace tokens or API keys for external services.

Using Academic Benchmarks#

Academic benchmarks use standardized datasets and evaluation protocols:

Standard Datasets: Most benchmarks include predefined datasets widely used in research.
Reproducible Metrics: Use established methodologies to calculate metrics.
Community Standards: You can compare results across different models and research groups.

Note

Some benchmarks (such as BFCL) also support custom datasets in the same format.

Choosing an Academic Benchmark#

Choose a benchmark based on the capability you want to test:

Academic Benchmark Comparison#
Benchmark	Primary Use Case	Key Metrics	Example Tasks
BigCode	Code generation and programming	pass@k, code correctness	HumanEval, MBPP, MBPP+
BFCL	Function/tool calling	tool-call accuracy, latency	Simple, parallel, multiple calls
LM Harness	General language understanding	accuracy, BLEU, F1, perplexity	GSM8K, MMLU, IFEval, BBH
Safety Harness	Model safety and alignment	safety scores, harm detection	Nemotron Content Safety V2, WildGuard
Simple Evals	Advanced reasoning, language understanding, math, alignment	exact match, math	GPQA, MMLU, Math Test 500, AIME

Options#

Learn about configuration examples and dataset formats for each of the following academic benchmark evaluations.

BigCode

Run code generation benchmarks using the BigCode Evaluation Harness.

humaneval mbpp pass@k

BigCode Evaluations

BFCL

Assess tool-calling capabilities with BFCL.

simple multiple tool-call-accuracy

BFCL Evaluations

LM Harness

Run academic benchmarks for general language understanding and reasoning.

gsm8k mmlu ifeval

LM Harness Evaluations

Safety Harness

Assess model safety on standard datasets.

aegis_v2 wildguard

Safety Harness Evaluations

Simple Evaluations

Run quick academic benchmarks (GPQA, MMLU, Math Test 500, AIME) with a simple chat-based configuration.

gpqa mmlu math-test-500 aime

Simple Evaluations