Benchmark Catalog#

Comprehensive catalog of 100+ benchmarks across 18 evaluation harnesses, all available through NGC containers and the NeMo Evaluator platform.

Overview#

NeMo Evaluator provides access to benchmarks across multiple domains through pre-built NGC containers and the unified launcher CLI. Each container specializes in different evaluation domains while maintaining consistent interfaces and reproducible results.

Available via Launcher#

# List all available benchmarks
nv-eval ls tasks

# Output as JSON for programmatic filtering
nv-eval ls tasks --json

# Filter for specific task types (example: academic benchmarks)
nv-eval ls tasks | grep -E "(mmlu|gsm8k|arc)"

Choosing Benchmarks for Academic Research#

Benchmark Selection Guide

For Language Understanding & General Knowledge: Recommended suite for comprehensive model evaluation:

  • mmlu_pro - Expert-level knowledge across 14 domains

  • arc_challenge - Complex reasoning and science questions

  • hellaswag - Commonsense reasoning about situations

  • truthfulqa - Factual accuracy vs. plausibility

nv-eval run \
    --config-dir examples \
    --config-name local_academic_suite \
    -o 'evaluation.tasks=["mmlu_pro", "arc_challenge", "hellaswag", "truthfulqa"]'

For Mathematical & Quantitative Reasoning:

  • gsm8k - Grade school math word problems

  • math - Competition-level mathematics

  • mgsm - Multilingual math reasoning

For Instruction Following & Alignment:

  • ifeval - Precise instruction following

  • gpqa_diamond - Graduate-level science questions

  • mtbench - Multi-turn conversation quality

See benchmark details below for complete task descriptions and requirements.

Benchmark Categories#

Academic and Reasoning#

Container

Benchmarks

Description

NGC Catalog

simple-evals

MMLU Pro, GSM8K, ARC Challenge

Core academic benchmarks

Link

lm-evaluation-harness

MMLU, HellaSwag, TruthfulQA, PIQA

Language model evaluation suite

Link

hle

Humanity’s Last Exam

Multi-modal benchmark at the frontier of human knowledge

Link

ifbench

Instruction Following Benchmark

Precise instruction following evaluation

Link

mmath

Multilingual Mathematical Reasoning

Math reasoning across multiple languages

Link

mtbench

MT-Bench

Multi-turn conversation evaluation

Link

Example Usage:

# Run academic benchmark suite
nv-eval run \
    --config-dir examples \
    --config-name local_llama_3_1_8b_instruct \
    -o 'evaluation.tasks=["mmlu_pro", "gsm8k", "arc_challenge"]'

Python API Example:

# Evaluate multiple academic benchmarks
academic_tasks = ["mmlu_pro", "gsm8k", "arc_challenge"]
for task in academic_tasks:
    eval_config = EvaluationConfig(
        type=task,
        output_dir=f"./results/{task}/",
        params=ConfigParams(temperature=0.01, parallelism=4)
    )
    result = evaluate(eval_cfg=eval_config, target_cfg=target_config)

Code Generation#

Container

Benchmarks

Description

NGC Catalog

bigcode-evaluation-harness

HumanEval, MBPP, APPS

Code generation and completion

Link

livecodebench

Live coding contests from LeetCode, AtCoder, CodeForces

Contamination-free coding evaluation

Link

scicode

Scientific research code generation

Scientific computing and research

Link

Example Usage:

# Run code generation evaluation
nv-eval run \
    --config-dir examples \
    --config-name local_llama_3_1_8b_instruct \
    -o 'evaluation.tasks=["humaneval", "mbpp"]'

Safety and Security#

Container

Benchmarks

Description

NGC Catalog

safety-harness

Toxicity, bias, alignment tests

Safety and bias evaluation

Link

garak

Prompt injection, jailbreaking

Security vulnerability scanning

Link

Example Usage:

# Run comprehensive safety evaluation
nv-eval run \
    --config-dir examples \
    --config-name local_llama_3_1_8b_instruct \
    -o 'evaluation.tasks=["aegis_v2", "garak"]'

Function Calling and Agentic AI#

Container

Benchmarks

Description

NGC Catalog

bfcl

Berkeley Function Calling Leaderboard

Function calling evaluation

Link

agentic_eval

Tool usage, planning tasks

Agentic AI evaluation

Link

tooltalk

Tool interaction evaluation

Tool usage assessment

Link

Vision-Language Models#

Container

Benchmarks

Description

NGC Catalog

vlmevalkit

VQA, image captioning, visual reasoning

Vision-language model evaluation

Link

Retrieval and RAG#

Container

Benchmarks

Description

NGC Catalog

rag_retriever_eval

Document retrieval, context relevance

RAG system evaluation

Link

Domain-Specific#

Container

Benchmarks

Description

NGC Catalog

helm

Medical AI evaluation (MedHELM)

Healthcare-specific benchmarking

Link

Container Details#

For detailed specifications of each container, see NeMo Evaluator Containers.

Quick Container Access#

Pull and run any evaluation container directly:

# Academic benchmarks
docker pull nvcr.io/nvidia/eval-factory/simple-evals:25.08.1
docker run --rm -it --gpus all nvcr.io/nvidia/eval-factory/simple-evals:25.08.1

# Code generation
docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.08.1
docker run --rm -it --gpus all nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.08.1

# Safety evaluation
docker pull nvcr.io/nvidia/eval-factory/safety-harness:25.08.1
docker run --rm -it --gpus all nvcr.io/nvidia/eval-factory/safety-harness:25.08.1

Available Tasks by Container#

For a complete list of available tasks in each container:

# List tasks in any container
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.08.1 eval-factory ls

# Or use the launcher for unified access
nv-eval ls tasks

Integration Patterns#

NeMo Evaluator provides multiple integration options to fit your workflow:

# Launcher CLI (recommended for most users)
nv-eval ls tasks
nv-eval run --config-dir examples --config-name local_mmlu_evaluation

# Container direct execution
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.08.1 eval-factory ls

# Python API (for programmatic control)
# See the Python API documentation for details

Benchmark Selection Best Practices#

For Academic Publications#

Recommended Core Suite:

  1. MMLU Pro or MMLU - Broad knowledge assessment

  2. GSM8K - Mathematical reasoning

  3. ARC Challenge - Scientific reasoning

  4. HellaSwag - Commonsense reasoning

  5. TruthfulQA - Factual accuracy

This suite provides comprehensive coverage across major evaluation dimensions.

For Model Development#

Iterative Testing:

  • Start with limit_samples=100 for quick feedback during development

  • Run full evaluations before major releases

  • Track metrics over time to measure improvement

Configuration:

# Development testing
params = ConfigParams(
    limit_samples=100,      # Quick iteration
    temperature=0.01,       # Deterministic
    parallelism=4
)

# Production evaluation
params = ConfigParams(
    limit_samples=None,     # Full dataset
    temperature=0.01,       # Deterministic
    parallelism=8          # Higher throughput
)

For Specialized Domains#

  • Code Models: Focus on humaneval, mbpp, livecodebench

  • Instruction Models: Emphasize ifeval, mtbench, gpqa_diamond

  • Multilingual Models: Include arc_multilingual, hellaswag_multilingual, mgsm

  • Safety-Critical: Prioritize safety-harness and garak evaluations

Next Steps#