About Selecting Benchmarks#

NeMo Evaluator provides a comprehensive suite of benchmarks spanning academic reasoning, code generation, safety testing, and domain-specific evaluations. Whether you’re validating a new model’s capabilities or conducting rigorous academic research, you’ll find the right benchmarks to assess your AI system’s performance. See Available Benchmarks for the complete catalog of available benchmarks.

Available via Launcher#

# List all available benchmarks
nemo-evaluator-launcher ls

# Output as JSON for programmatic filtering
nemo-evaluator-launcher ls --json

# Filter for specific task types (example: mmlu, gsm8k, and arc_challenge)
nemo-evaluator-launcher ls | grep -E "(mmlu|gsm8k|arc)"

Available via Direct Container Access#

# List benchmarks available in the container
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.10 nemo-evaluator ls

Choosing Benchmarks for Academic Research#

Benchmark Selection Guide

For General Knowledge:

  • mmlu_pro - Expert-level knowledge across 14 domains

  • gpqa_diamond - Graduate-level science questions

For Mathematical & Quantitative Reasoning:

  • AIME_2025 - American Invitational Mathematics Examination (AIME) 2025 questions

  • mgsm - Multilingual math reasoning

For Instruction Following & Alignment:

  • ifbench - Precise instruction following

  • mtbench - Multi-turn conversation quality

See benchmark categories below and Available Benchmarks for more details.

Benchmark Categories#

Academic and Reasoning#

Container

Description

NGC Catalog

Benchmarks

simple-evals

Common evaluation tasks

Link

GPQA-D, MATH-500, AIME 24 & 25, HumanEval, HumanEval+, MGSM, MMLU (also multilingual), MMLU-Pro, MMLU-lite (AR, BN, DE, EN, ES, FR, HI, ID, IT, JA, KO, MY, PT, SW, YO, ZH), SimpleQA, BrowseComp, HealthBench

lm-evaluation-harness

Language model benchmarks

Link

ARC Challenge (also multilingual), GSM8K, HumanEval, HumanEval+, MBPP, MBPP+, MINERVA Math, RACE, AGIEval, BBH, BBQ, CSQA, Frames, Global MMLU, GPQA-D, HellaSwag (also multilingual), IFEval, MGSM, MMLU, MMLU-Pro, MMLU-ProX (de, es, fr, it, ja), MMLU-Redux, MUSR, OpenbookQA, Piqa, Social IQa, TruthfulQA, WikiLingua, WinoGrande

hle

Academic knowledge and problem solving

Link

HLE

ifbench

Instruction following

Link

IFBench

mtbench

Multi-turn conversation evaluation

Link

MT-Bench

nemo-skills

Language model benchmarks (science, math, agentic)

Link

AIME 24 & 25, BFCL_v3, GPQA, HLE, LiveCodeBench, MMLU, MMLU-Pro

profbench

Evaluation of professional knowledge accross Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA

Link

Report Gerenation, LLM Judge

Note

BFCL tasks from the nemo-skills container require function calling capabilities. See Testing Endpoint Compatibility for checking if your endpoint is compatible.

Example Usage:

Create config.yml:

defaults:
  - execution: local
  - deployment: none
  - _self_

evaluation:
  tasks:
    - name: ifeval
    - name: gsm8k_cot_instruct
    - name: gpqa_diamond

Run evaluation:

export NGC_API_KEY=nvapi-...
export HF_TOKEN=hf_...

nemo-evaluator-launcher run \
    --config ./config.yml \
    -o execution.output_dir=results \
    -o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o +target.api_endpoint.api_key_name=NGC_API_KEY

Code Generation#

Container

Description

NGC Catalog

Benchmarks

bigcode-evaluation-harness

Code generation evaluation

Link

MBPP, MBPP-Plus, HumanEval, HumanEval+, Multiple (cpp, cs, d, go, java, jl, js, lua, php, pl, py, r, rb, rkt, rs, scala, sh, swift, ts)

livecodebench

Coding

Link

LiveCodeBench (v1-v6, 0724_0125, 0824_0225)

scicode

Coding for scientific research

Link

SciCode

Example Usage:

Create config.yml:

defaults:
  - execution: local
  - deployment: none
  - _self_

evaluation:
  tasks:
    - name: humaneval_instruct
    - name: mbbp

Run evaluation:

export NGC_API_KEY=nvapi-...

nemo-evaluator-launcher run \
    --config ./config.yml \
    -o execution.output_dir=results \
    -o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o +target.api_endpoint.api_key_name=NGC_API_KEY

Safety and Security#

Container

Description

NGC Catalog

Benchmarks

garak

Safety and vulnerability testing

Link

Garak

safety-harness

Safety and bias evaluation

Link

Aegis v2, WildGuard

Example Usage:

Create config.yml:

defaults:
  - execution: local
  - deployment: none
  - _self_

evaluation:
  tasks:
    - name: aegis_v2
    - name: garak

Run evaluation:

export NGC_API_KEY=nvapi-...
export HF_TOKEN=hf_...

nemo-evaluator-launcher run \
    --config ./config.yml \
    -o execution.output_dir=results \
    -o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o +target.api_endpoint.api_key_name=NGC_API_KEY

Function Calling#

Container

Description

NGC Catalog

Benchmarks

bfcl

Function calling

Link

BFCL v2 and v3

tooltalk

Tool usage evaluation

Link

ToolTalk

Note

Some of the tasks in this category require function calling capabilities. See Testing Endpoint Compatibility for checking if your endpoint is compatible.

Example Usage:

Create config.yml:

defaults:
  - execution: local
  - deployment: none
  - _self_

evaluation:
  tasks:
    - name: bfclv2_ast_prompting
    - name: tooltalk

Run evaluation:

export NGC_API_KEY=nvapi-...

nemo-evaluator-launcher run \
    --config ./config.yml \
    -o execution.output_dir=results \
    -o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o +target.api_endpoint.api_key_name=NGC_API_KEY

Vision-Language Models#

Container

Description

NGC Catalog

Benchmarks

vlmevalkit

Vision-language model evaluation

Link

AI2D, ChartQA, MMMU, MathVista-MINI, OCRBench, SlideVQA

Note

The tasks in this category require a VLM chat endpoint. See Testing Endpoint Compatibility for checking if your endpoint is compatible.

Example Usage:

Create config.yml:

defaults:
  - execution: local
  - deployment: none
  - _self_

evaluation:
  tasks:
    - name: ocrbench
    - name: chartqa

Run evaluation:

export NGC_API_KEY=nvapi-...

nemo-evaluator-launcher run \
    --config ./config.yml \
    -o execution.output_dir=results \
    -o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o +target.api_endpoint.api_key_name=NGC_API_KEY

Domain-Specific#

Container

Description

NGC Catalog

Benchmarks

helm

Holistic evaluation framework

Link

MedHelm

Example Usage:

Create config.yml:

defaults:
  - execution: local
  - deployment: none
  - _self_

evaluation:
  tasks:
    - name: pubmed_qa
    - name: medcalc_bench

Run evaluation:

export NGC_API_KEY=nvapi-...

nemo-evaluator-launcher run \
    --config ./config.yml \
    -o execution.output_dir=results \
    -o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o +target.api_endpoint.api_key_name=NGC_API_KEY

Container Details#

For detailed specifications of each container, see NeMo Evaluator Containers.

Quick Container Access#

Pull and run any evaluation container directly:

# Academic benchmarks
docker pull nvcr.io/nvidia/eval-factory/simple-evals:25.10
docker run --rm -it nvcr.io/nvidia/eval-factory/simple-evals:25.10

# Code generation
docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.10
docker run --rm -it nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.10

# Safety evaluation
docker pull nvcr.io/nvidia/eval-factory/safety-harness:25.10
docker run --rm -it nvcr.io/nvidia/eval-factory/safety-harness:25.10

Available Tasks by Container#

For a complete list of available tasks in each container:

# List tasks in any container
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.10 nemo-evaluator ls

# Or use the launcher for unified access
nemo-evaluator-launcher ls tasks

Integration Patterns#

NeMo Evaluator provides multiple integration options to fit your workflow:

# Launcher CLI (recommended for most users)
nemo-evaluator-launcher ls tasks
nemo-evaluator-launcher run --config ./local_mmlu_evaluation.yaml

# Container direct execution
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.10 nemo-evaluator ls

# Python API (for programmatic control)
# See the Python API documentation for details

Benchmark Selection Best Practices#

For Model Development#

Iterative Testing:

  • Start with limit_samples=100 for quick feedback during development

  • Run full evaluations before major releases

  • Track metrics over time to measure improvement

Configuration:

# Development testing
params = ConfigParams(
    limit_samples=100,      # Quick iteration
    temperature=0.01,       # Deterministic
    parallelism=4
)

# Production evaluation
params = ConfigParams(
    limit_samples=None,     # Full dataset
    temperature=0.01,       # Deterministic
    parallelism=8          # Higher throughput
)

For Specialized Domains#

  • Code Models: Focus on humaneval, mbpp, livecodebench

  • Instruction Models: Emphasize ifbench, mtbench

  • Multilingual Models: Include arc_multilingual, hellaswag_multilingual, mgsm

  • Safety-Critical: Prioritize safety-harness and garak evaluations

Next Steps#