Benchmark Catalog#

Comprehensive catalog of 100+ benchmarks across 18 evaluation harnesses, all available through NGC containers and the NeMo Evaluator platform.

Overview#

NeMo Evaluator provides access to benchmarks across multiple domains through pre-built NGC containers and the unified launcher CLI. Each container specializes in different evaluation domains while maintaining consistent interfaces and reproducible results.

Available via Launcher#

# List all available benchmarks
nv-eval ls tasks

# Output as JSON for programmatic filtering
nv-eval ls tasks --json

# Filter for specific task types (example: academic benchmarks)
nv-eval ls tasks | grep -E "(mmlu|gsm8k|arc)"

Choosing Benchmarks for Academic Research#

Benchmark Selection Guide

For Language Understanding & General Knowledge: Recommended suite for comprehensive model evaluation:

mmlu_pro - Expert-level knowledge across 14 domains
arc_challenge - Complex reasoning and science questions
hellaswag - Commonsense reasoning about situations
truthfulqa - Factual accuracy vs. plausibility

nv-eval run \
    --config-dir examples \
    --config-name local_academic_suite \
    -o 'evaluation.tasks=["mmlu_pro", "arc_challenge", "hellaswag", "truthfulqa"]'

For Mathematical & Quantitative Reasoning:

gsm8k - Grade school math word problems
math - Competition-level mathematics
mgsm - Multilingual math reasoning

For Instruction Following & Alignment:

ifeval - Precise instruction following
gpqa_diamond - Graduate-level science questions
mtbench - Multi-turn conversation quality

See benchmark details below for complete task descriptions and requirements.

Benchmark Categories#

Academic and Reasoning#

Container	Benchmarks	Description	NGC Catalog
simple-evals	MMLU Pro, GSM8K, ARC Challenge	Core academic benchmarks	Link
lm-evaluation-harness	MMLU, HellaSwag, TruthfulQA, PIQA	Language model evaluation suite	Link
hle	Humanity’s Last Exam	Multi-modal benchmark at the frontier of human knowledge	Link
ifbench	Instruction Following Benchmark	Precise instruction following evaluation	Link
mmath	Multilingual Mathematical Reasoning	Math reasoning across multiple languages	Link
mtbench	MT-Bench	Multi-turn conversation evaluation	Link

Example Usage:

# Run academic benchmark suite
nv-eval run \
    --config-dir examples \
    --config-name local_llama_3_1_8b_instruct \
    -o 'evaluation.tasks=["mmlu_pro", "gsm8k", "arc_challenge"]'

Python API Example:

# Evaluate multiple academic benchmarks
academic_tasks = ["mmlu_pro", "gsm8k", "arc_challenge"]
for task in academic_tasks:
    eval_config = EvaluationConfig(
        type=task,
        output_dir=f"./results/{task}/",
        params=ConfigParams(temperature=0.01, parallelism=4)
    )
    result = evaluate(eval_cfg=eval_config, target_cfg=target_config)

Code Generation#

Container	Benchmarks	Description	NGC Catalog
bigcode-evaluation-harness	HumanEval, MBPP, APPS	Code generation and completion	Link
livecodebench	Live coding contests from LeetCode, AtCoder, CodeForces	Contamination-free coding evaluation	Link
scicode	Scientific research code generation	Scientific computing and research	Link

Example Usage:

# Run code generation evaluation
nv-eval run \
    --config-dir examples \
    --config-name local_llama_3_1_8b_instruct \
    -o 'evaluation.tasks=["humaneval", "mbpp"]'

Safety and Security#

Container	Benchmarks	Description	NGC Catalog
safety-harness	Toxicity, bias, alignment tests	Safety and bias evaluation	Link
garak	Prompt injection, jailbreaking	Security vulnerability scanning	Link

Example Usage:

# Run comprehensive safety evaluation
nv-eval run \
    --config-dir examples \
    --config-name local_llama_3_1_8b_instruct \
    -o 'evaluation.tasks=["aegis_v2", "garak"]'

Function Calling and Agentic AI#

Container	Benchmarks	Description	NGC Catalog
bfcl	Berkeley Function Calling Leaderboard	Function calling evaluation	Link
agentic_eval	Tool usage, planning tasks	Agentic AI evaluation	Link
tooltalk	Tool interaction evaluation	Tool usage assessment	Link

Vision-Language Models#

Container	Benchmarks	Description	NGC Catalog
vlmevalkit	VQA, image captioning, visual reasoning	Vision-language model evaluation	Link

Retrieval and RAG#

Container	Benchmarks	Description	NGC Catalog
rag_retriever_eval	Document retrieval, context relevance	RAG system evaluation	Link

Domain-Specific#

Container	Benchmarks	Description	NGC Catalog
helm	Medical AI evaluation (MedHELM)	Healthcare-specific benchmarking	Link

Container Details#

For detailed specifications of each container, see NeMo Evaluator Containers.

Quick Container Access#

Pull and run any evaluation container directly:

# Academic benchmarks
docker pull nvcr.io/nvidia/eval-factory/simple-evals:25.08.1
docker run --rm -it --gpus all nvcr.io/nvidia/eval-factory/simple-evals:25.08.1

# Code generation
docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.08.1
docker run --rm -it --gpus all nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.08.1

# Safety evaluation
docker pull nvcr.io/nvidia/eval-factory/safety-harness:25.08.1
docker run --rm -it --gpus all nvcr.io/nvidia/eval-factory/safety-harness:25.08.1

Available Tasks by Container#

For a complete list of available tasks in each container:

# List tasks in any container
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.08.1 eval-factory ls

# Or use the launcher for unified access
nv-eval ls tasks

Integration Patterns#

NeMo Evaluator provides multiple integration options to fit your workflow:

# Launcher CLI (recommended for most users)
nv-eval ls tasks
nv-eval run --config-dir examples --config-name local_mmlu_evaluation

# Container direct execution
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.08.1 eval-factory ls

# Python API (for programmatic control)
# See the Python API documentation for details

Benchmark Selection Best Practices#

For Academic Publications#

Recommended Core Suite:

MMLU Pro or MMLU - Broad knowledge assessment
GSM8K - Mathematical reasoning
ARC Challenge - Scientific reasoning
HellaSwag - Commonsense reasoning
TruthfulQA - Factual accuracy

This suite provides comprehensive coverage across major evaluation dimensions.

For Model Development#

Iterative Testing:

Start with limit_samples=100 for quick feedback during development
Run full evaluations before major releases
Track metrics over time to measure improvement

Configuration:

# Development testing
params = ConfigParams(
    limit_samples=100,      # Quick iteration
    temperature=0.01,       # Deterministic
    parallelism=4
)

# Production evaluation
params = ConfigParams(
    limit_samples=None,     # Full dataset
    temperature=0.01,       # Deterministic
    parallelism=8          # Higher throughput
)

For Specialized Domains#

Code Models: Focus on humaneval, mbpp, livecodebench
Instruction Models: Emphasize ifeval, mtbench, gpqa_diamond
Multilingual Models: Include arc_multilingual, hellaswag_multilingual, mgsm
Safety-Critical: Prioritize safety-harness and garak evaluations

Next Steps#

Quick Start: See About Evaluation for the fastest path to your first evaluation
Task-Specific Guides: Explore Run Evaluations for detailed evaluation workflows
Configuration: Review Evaluation Configuration Parameters for optimizing evaluation settings
Container Details: Browse NeMo Evaluator Containers for complete specifications
Custom Benchmarks: Learn Framework Definition File (FDF) for custom evaluations