Benchmark Catalog#
Comprehensive catalog of 100+ benchmarks across 18 evaluation harnesses, all available through NGC containers and the NeMo Evaluator platform.
Overview#
NeMo Evaluator provides access to benchmarks across multiple domains through pre-built NGC containers and the unified launcher CLI. Each container specializes in different evaluation domains while maintaining consistent interfaces and reproducible results.
Available via Launcher#
# List all available benchmarks
nv-eval ls tasks
# Output as JSON for programmatic filtering
nv-eval ls tasks --json
# Filter for specific task types (example: academic benchmarks)
nv-eval ls tasks | grep -E "(mmlu|gsm8k|arc)"
Choosing Benchmarks for Academic Research#
Benchmark Selection Guide
For Language Understanding & General Knowledge: Recommended suite for comprehensive model evaluation:
mmlu_pro
- Expert-level knowledge across 14 domainsarc_challenge
- Complex reasoning and science questionshellaswag
- Commonsense reasoning about situationstruthfulqa
- Factual accuracy vs. plausibility
nv-eval run \
--config-dir examples \
--config-name local_academic_suite \
-o 'evaluation.tasks=["mmlu_pro", "arc_challenge", "hellaswag", "truthfulqa"]'
For Mathematical & Quantitative Reasoning:
gsm8k
- Grade school math word problemsmath
- Competition-level mathematicsmgsm
- Multilingual math reasoning
For Instruction Following & Alignment:
ifeval
- Precise instruction followinggpqa_diamond
- Graduate-level science questionsmtbench
- Multi-turn conversation quality
See benchmark details below for complete task descriptions and requirements.
Benchmark Categories#
Academic and Reasoning#
Container |
Benchmarks |
Description |
NGC Catalog |
---|---|---|---|
simple-evals |
MMLU Pro, GSM8K, ARC Challenge |
Core academic benchmarks |
|
lm-evaluation-harness |
MMLU, HellaSwag, TruthfulQA, PIQA |
Language model evaluation suite |
|
hle |
Humanity’s Last Exam |
Multi-modal benchmark at the frontier of human knowledge |
|
ifbench |
Instruction Following Benchmark |
Precise instruction following evaluation |
|
mmath |
Multilingual Mathematical Reasoning |
Math reasoning across multiple languages |
|
mtbench |
MT-Bench |
Multi-turn conversation evaluation |
Example Usage:
# Run academic benchmark suite
nv-eval run \
--config-dir examples \
--config-name local_llama_3_1_8b_instruct \
-o 'evaluation.tasks=["mmlu_pro", "gsm8k", "arc_challenge"]'
Python API Example:
# Evaluate multiple academic benchmarks
academic_tasks = ["mmlu_pro", "gsm8k", "arc_challenge"]
for task in academic_tasks:
eval_config = EvaluationConfig(
type=task,
output_dir=f"./results/{task}/",
params=ConfigParams(temperature=0.01, parallelism=4)
)
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
Code Generation#
Container |
Benchmarks |
Description |
NGC Catalog |
---|---|---|---|
bigcode-evaluation-harness |
HumanEval, MBPP, APPS |
Code generation and completion |
|
livecodebench |
Live coding contests from LeetCode, AtCoder, CodeForces |
Contamination-free coding evaluation |
|
scicode |
Scientific research code generation |
Scientific computing and research |
Example Usage:
# Run code generation evaluation
nv-eval run \
--config-dir examples \
--config-name local_llama_3_1_8b_instruct \
-o 'evaluation.tasks=["humaneval", "mbpp"]'
Safety and Security#
Container |
Benchmarks |
Description |
NGC Catalog |
---|---|---|---|
safety-harness |
Toxicity, bias, alignment tests |
Safety and bias evaluation |
|
garak |
Prompt injection, jailbreaking |
Security vulnerability scanning |
Example Usage:
# Run comprehensive safety evaluation
nv-eval run \
--config-dir examples \
--config-name local_llama_3_1_8b_instruct \
-o 'evaluation.tasks=["aegis_v2", "garak"]'
Function Calling and Agentic AI#
Vision-Language Models#
Container |
Benchmarks |
Description |
NGC Catalog |
---|---|---|---|
vlmevalkit |
VQA, image captioning, visual reasoning |
Vision-language model evaluation |
Retrieval and RAG#
Container |
Benchmarks |
Description |
NGC Catalog |
---|---|---|---|
rag_retriever_eval |
Document retrieval, context relevance |
RAG system evaluation |
Domain-Specific#
Container |
Benchmarks |
Description |
NGC Catalog |
---|---|---|---|
helm |
Medical AI evaluation (MedHELM) |
Healthcare-specific benchmarking |
Container Details#
For detailed specifications of each container, see NeMo Evaluator Containers.
Quick Container Access#
Pull and run any evaluation container directly:
# Academic benchmarks
docker pull nvcr.io/nvidia/eval-factory/simple-evals:25.08.1
docker run --rm -it --gpus all nvcr.io/nvidia/eval-factory/simple-evals:25.08.1
# Code generation
docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.08.1
docker run --rm -it --gpus all nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.08.1
# Safety evaluation
docker pull nvcr.io/nvidia/eval-factory/safety-harness:25.08.1
docker run --rm -it --gpus all nvcr.io/nvidia/eval-factory/safety-harness:25.08.1
Available Tasks by Container#
For a complete list of available tasks in each container:
# List tasks in any container
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.08.1 eval-factory ls
# Or use the launcher for unified access
nv-eval ls tasks
Integration Patterns#
NeMo Evaluator provides multiple integration options to fit your workflow:
# Launcher CLI (recommended for most users)
nv-eval ls tasks
nv-eval run --config-dir examples --config-name local_mmlu_evaluation
# Container direct execution
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.08.1 eval-factory ls
# Python API (for programmatic control)
# See the Python API documentation for details
Benchmark Selection Best Practices#
For Academic Publications#
Recommended Core Suite:
MMLU Pro or MMLU - Broad knowledge assessment
GSM8K - Mathematical reasoning
ARC Challenge - Scientific reasoning
HellaSwag - Commonsense reasoning
TruthfulQA - Factual accuracy
This suite provides comprehensive coverage across major evaluation dimensions.
For Model Development#
Iterative Testing:
Start with
limit_samples=100
for quick feedback during developmentRun full evaluations before major releases
Track metrics over time to measure improvement
Configuration:
# Development testing
params = ConfigParams(
limit_samples=100, # Quick iteration
temperature=0.01, # Deterministic
parallelism=4
)
# Production evaluation
params = ConfigParams(
limit_samples=None, # Full dataset
temperature=0.01, # Deterministic
parallelism=8 # Higher throughput
)
For Specialized Domains#
Code Models: Focus on
humaneval
,mbpp
,livecodebench
Instruction Models: Emphasize
ifeval
,mtbench
,gpqa_diamond
Multilingual Models: Include
arc_multilingual
,hellaswag_multilingual
,mgsm
Safety-Critical: Prioritize
safety-harness
andgarak
evaluations
Next Steps#
Quick Start: See About Evaluation for the fastest path to your first evaluation
Task-Specific Guides: Explore Run Evaluations for detailed evaluation workflows
Configuration: Review Evaluation Configuration Parameters for optimizing evaluation settings
Container Details: Browse NeMo Evaluator Containers for complete specifications
Custom Benchmarks: Learn Framework Definition File (FDF) for custom evaluations