About Selecting Benchmarks#

NeMo Evaluator provides a comprehensive suite of benchmarks spanning academic reasoning, code generation, safety testing, and domain-specific evaluations. Whether you’re validating a new model’s capabilities or conducting rigorous academic research, you’ll find the right benchmarks to assess your AI system’s performance. See Available Benchmarks for the complete catalog of available benchmarks.

Available via Launcher#

# List all available benchmarks
nemo-evaluator-launcher ls

# Output as JSON for programmatic filtering
nemo-evaluator-launcher ls --json

# Filter for specific task types (example: mmlu, gsm8k, and arc_challenge)
nemo-evaluator-launcher ls | grep -E "(mmlu|gsm8k|arc)"

Available via Direct Container Access#

# List benchmarks available in the container
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.10 nemo-evaluator ls

Choosing Benchmarks for Academic Research#

Benchmark Selection Guide

For General Knowledge:

mmlu_pro - Expert-level knowledge across 14 domains
gpqa_diamond - Graduate-level science questions

For Mathematical & Quantitative Reasoning:

AIME_2025 - American Invitational Mathematics Examination (AIME) 2025 questions
mgsm - Multilingual math reasoning

For Instruction Following & Alignment:

ifbench - Precise instruction following
mtbench - Multi-turn conversation quality

See benchmark categories below and Available Benchmarks for more details.

Benchmark Categories#

Academic and Reasoning#

Container	Description	NGC Catalog	Benchmarks
simple-evals	Common evaluation tasks	Link	GPQA-D, MATH-500, AIME 24 & 25, HumanEval, HumanEval+, MGSM, MMLU (also multilingual), MMLU-Pro, MMLU-lite (AR, BN, DE, EN, ES, FR, HI, ID, IT, JA, KO, MY, PT, SW, YO, ZH), SimpleQA, BrowseComp, HealthBench
lm-evaluation-harness	Language model benchmarks	Link	ARC Challenge (also multilingual), GSM8K, HumanEval, HumanEval+, MBPP, MBPP+, MINERVA Math, RACE, AGIEval, BBH, BBQ, CSQA, Frames, Global MMLU, GPQA-D, HellaSwag (also multilingual), IFEval, MGSM, MMLU, MMLU-Pro, MMLU-ProX (de, es, fr, it, ja), MMLU-Redux, MUSR, OpenbookQA, Piqa, Social IQa, TruthfulQA, WikiLingua, WinoGrande
hle	Academic knowledge and problem solving	Link	HLE
ifbench	Instruction following	Link	IFBench
mtbench	Multi-turn conversation evaluation	Link	MT-Bench
nemo-skills	Language model benchmarks (science, math, agentic)	Link	AIME 24 & 25, BFCL_v3, GPQA, HLE, LiveCodeBench, MMLU, MMLU-Pro
profbench	Evaluation of professional knowledge accross Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA	Link	Report Gerenation, LLM Judge

Note

BFCL tasks from the nemo-skills container require function calling capabilities. See Testing Endpoint Compatibility for checking if your endpoint is compatible.

Example Usage:

Create config.yml:

defaults:
  - execution: local
  - deployment: none
  - _self_

evaluation:
  tasks:
    - name: ifeval
    - name: gsm8k_cot_instruct
    - name: gpqa_diamond

Run evaluation:

export NGC_API_KEY=nvapi-...
export HF_TOKEN=hf_...

nemo-evaluator-launcher run \
    --config ./config.yml \
    -o execution.output_dir=results \
    -o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o +target.api_endpoint.api_key_name=NGC_API_KEY

Code Generation#

Container	Description	NGC Catalog	Benchmarks
bigcode-evaluation-harness	Code generation evaluation	Link	MBPP, MBPP-Plus, HumanEval, HumanEval+, Multiple (cpp, cs, d, go, java, jl, js, lua, php, pl, py, r, rb, rkt, rs, scala, sh, swift, ts)
livecodebench	Coding	Link	LiveCodeBench (v1-v6, 0724_0125, 0824_0225)
scicode	Coding for scientific research	Link	SciCode

Example Usage:

Create config.yml:

defaults:
  - execution: local
  - deployment: none
  - _self_

evaluation:
  tasks:
    - name: humaneval_instruct
    - name: mbbp

Run evaluation:

export NGC_API_KEY=nvapi-...

nemo-evaluator-launcher run \
    --config ./config.yml \
    -o execution.output_dir=results \
    -o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o +target.api_endpoint.api_key_name=NGC_API_KEY

Safety and Security#

Container	Description	NGC Catalog	Benchmarks
garak	Safety and vulnerability testing	Link	Garak
safety-harness	Safety and bias evaluation	Link	Aegis v2, WildGuard

Example Usage:

Create config.yml:

defaults:
  - execution: local
  - deployment: none
  - _self_

evaluation:
  tasks:
    - name: aegis_v2
    - name: garak

Run evaluation:

export NGC_API_KEY=nvapi-...
export HF_TOKEN=hf_...

nemo-evaluator-launcher run \
    --config ./config.yml \
    -o execution.output_dir=results \
    -o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o +target.api_endpoint.api_key_name=NGC_API_KEY

Function Calling#

Container	Description	NGC Catalog	Benchmarks
bfcl	Function calling	Link	BFCL v2 and v3
tooltalk	Tool usage evaluation	Link	ToolTalk

Note

Some of the tasks in this category require function calling capabilities. See Testing Endpoint Compatibility for checking if your endpoint is compatible.

Example Usage:

Create config.yml:

defaults:
  - execution: local
  - deployment: none
  - _self_

evaluation:
  tasks:
    - name: bfclv2_ast_prompting
    - name: tooltalk

Run evaluation:

export NGC_API_KEY=nvapi-...

nemo-evaluator-launcher run \
    --config ./config.yml \
    -o execution.output_dir=results \
    -o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o +target.api_endpoint.api_key_name=NGC_API_KEY

Vision-Language Models#

Container	Description	NGC Catalog	Benchmarks
vlmevalkit	Vision-language model evaluation	Link	AI2D, ChartQA, MMMU, MathVista-MINI, OCRBench, SlideVQA

Note

The tasks in this category require a VLM chat endpoint. See Testing Endpoint Compatibility for checking if your endpoint is compatible.

Example Usage:

Create config.yml:

defaults:
  - execution: local
  - deployment: none
  - _self_

evaluation:
  tasks:
    - name: ocrbench
    - name: chartqa

Run evaluation:

export NGC_API_KEY=nvapi-...

nemo-evaluator-launcher run \
    --config ./config.yml \
    -o execution.output_dir=results \
    -o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o +target.api_endpoint.api_key_name=NGC_API_KEY

Domain-Specific#

Container	Description	NGC Catalog	Benchmarks
helm	Holistic evaluation framework	Link	MedHelm

Example Usage:

Create config.yml:

defaults:
  - execution: local
  - deployment: none
  - _self_

evaluation:
  tasks:
    - name: pubmed_qa
    - name: medcalc_bench

Run evaluation:

export NGC_API_KEY=nvapi-...

nemo-evaluator-launcher run \
    --config ./config.yml \
    -o execution.output_dir=results \
    -o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o +target.api_endpoint.api_key_name=NGC_API_KEY

Container Details#

For detailed specifications of each container, see NeMo Evaluator Containers.

Quick Container Access#

Pull and run any evaluation container directly:

# Academic benchmarks
docker pull nvcr.io/nvidia/eval-factory/simple-evals:25.10
docker run --rm -it nvcr.io/nvidia/eval-factory/simple-evals:25.10

# Code generation
docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.10
docker run --rm -it nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.10

# Safety evaluation
docker pull nvcr.io/nvidia/eval-factory/safety-harness:25.10
docker run --rm -it nvcr.io/nvidia/eval-factory/safety-harness:25.10

Available Tasks by Container#

For a complete list of available tasks in each container:

# List tasks in any container
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.10 nemo-evaluator ls

# Or use the launcher for unified access
nemo-evaluator-launcher ls tasks

Integration Patterns#

NeMo Evaluator provides multiple integration options to fit your workflow:

# Launcher CLI (recommended for most users)
nemo-evaluator-launcher ls tasks
nemo-evaluator-launcher run --config ./local_mmlu_evaluation.yaml

# Container direct execution
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.10 nemo-evaluator ls

# Python API (for programmatic control)
# See the Python API documentation for details

Benchmark Selection Best Practices#

For Model Development#

Iterative Testing:

Start with limit_samples=100 for quick feedback during development
Run full evaluations before major releases
Track metrics over time to measure improvement

Configuration:

# Development testing
params = ConfigParams(
    limit_samples=100,      # Quick iteration
    temperature=0.01,       # Deterministic
    parallelism=4
)

# Production evaluation
params = ConfigParams(
    limit_samples=None,     # Full dataset
    temperature=0.01,       # Deterministic
    parallelism=8          # Higher throughput
)

For Specialized Domains#

Code Models: Focus on humaneval, mbpp, livecodebench
Instruction Models: Emphasize ifbench, mtbench
Multilingual Models: Include arc_multilingual, hellaswag_multilingual, mgsm
Safety-Critical: Prioritize safety-harness and garak evaluations

Next Steps#

Container Details: Browse NeMo Evaluator Containers for complete specifications
Custom Benchmarks: Learn Framework Definition File (FDF) for custom evaluations