Benchmark Catalog#

Comprehensive catalog of hundreds of benchmarks across popular evaluation harnesses, all available through NGC containers and the NeMo Evaluator platform.

Available via Launcher#

# List all available benchmarks
nemo-evaluator-launcher ls

# Output as JSON for programmatic filtering
nemo-evaluator-launcher ls --json

# Filter for specific task types (example: mmlu, gsm8k, and arc_challenge)
nemo-evaluator-launcher ls | grep -E "(mmlu|gsm8k|arc)"

Available via Direct Container Access#

# List benchmarks available in the container
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.10 nemo-evaluator ls

Choosing Benchmarks for Academic Research#

Benchmark Selection Guide

For General Knowledge:

  • mmlu_pro - Expert-level knowledge across 14 domains

  • gpqa_diamond - Graduate-level science questions

For Mathematical & Quantitative Reasoning:

  • AIME_2025 - American Invitational Mathematics Examination (AIME) 2025 questions

  • mgsm - Multilingual math reasoning

For Instruction Following & Alignment:

  • ifbench - Precise instruction following

  • mtbench - Multi-turn conversation quality

See benchmark categories below and Full Benchmarks List for more details.

Benchmark Categories#

Academic and Reasoning#

Container

Description

NGC Catalog

Benchmarks

simple-evals

Common evaluation tasks

Link

GPQA-D, MATH-500, AIME 24 & 25, HumanEval, HumanEval+, MGSM, MMLU (also multilingual), MMLU-Pro, MMLU-lite (AR, BN, DE, EN, ES, FR, HI, ID, IT, JA, KO, MY, PT, SW, YO, ZH), SimpleQA, BrowseComp, HealthBench

lm-evaluation-harness

Language model benchmarks

Link

ARC Challenge (also multilingual), GSM8K, HumanEval, HumanEval+, MBPP, MBPP+, MINERVA Math, RACE, AGIEval, BBH, BBQ, CSQA, Frames, Global MMLU, GPQA-D, HellaSwag (also multilingual), IFEval, MGSM, MMLU, MMLU-Pro, MMLU-ProX (de, es, fr, it, ja), MMLU-Redux, MUSR, OpenbookQA, Piqa, Social IQa, TruthfulQA, WikiLingua, WinoGrande

hle

Academic knowledge and problem solving

Link

HLE

ifbench

Instruction following

Link

IFBench

mtbench

Multi-turn conversation evaluation

Link

MT-Bench

nemo-skills

Language model benchmarks (science, math, agentic)

Link

AIME 24 & 25, BFCL_v3, GPQA, HLE, LiveCodeBench, MMLU, MMLU-Pro

profbench

Evaluation of professional knowledge accross Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA

Link

Report Gerenation, LLM Judge

Note

BFCL tasks from the nemo-skills container require function calling capabilities. See Testing Endpoint Compatibility for checking if your endpoint is compatible.

Example Usage:

Create config.yml:

defaults:
  - execution: local
  - deployment: none
  - _self_

evaluation:
  tasks:
    - name: ifeval
    - name: gsm8k_cot_instruct
    - name: gpqa_diamond

Run evaluation:

export NGC_API_KEY=nvapi-...
export HF_TOKEN=hf_...

nemo-evaluator-launcher run \
    --config ./config.yml \
    -o execution.output_dir=results \
    -o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o +target.api_endpoint.api_key_name=NGC_API_KEY

Code Generation#

Container

Description

NGC Catalog

Benchmarks

bigcode-evaluation-harness

Code generation evaluation

Link

MBPP, MBPP-Plus, HumanEval, HumanEval+, Multiple (cpp, cs, d, go, java, jl, js, lua, php, pl, py, r, rb, rkt, rs, scala, sh, swift, ts)

livecodebench

Coding

Link

LiveCodeBench (v1-v6, 0724_0125, 0824_0225)

scicode

Coding for scientific research

Link

SciCode

Example Usage:

Create config.yml:

defaults:
  - execution: local
  - deployment: none
  - _self_

evaluation:
  tasks:
    - name: humaneval_instruct
    - name: mbbp

Run evaluation:

export NGC_API_KEY=nvapi-...

nemo-evaluator-launcher run \
    --config ./config.yml \
    -o execution.output_dir=results \
    -o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o +target.api_endpoint.api_key_name=NGC_API_KEY

Safety and Security#

Container

Description

NGC Catalog

Benchmarks

garak

Safety and vulnerability testing

Link

Garak

safety-harness

Safety and bias evaluation

Link

Aegis v2, WildGuard

Example Usage:

Create config.yml:

defaults:
  - execution: local
  - deployment: none
  - _self_

evaluation:
  tasks:
    - name: aegis_v2
    - name: garak

Run evaluation:

export NGC_API_KEY=nvapi-...
export HF_TOKEN=hf_...

nemo-evaluator-launcher run \
    --config ./config.yml \
    -o execution.output_dir=results \
    -o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o +target.api_endpoint.api_key_name=NGC_API_KEY

Function Calling#

Container

Description

NGC Catalog

Benchmarks

bfcl

Function calling

Link

BFCL v2 and v3

tooltalk

Tool usage evaluation

Link

ToolTalk

Note

Some of the tasks in this category require function calling capabilities. See Testing Endpoint Compatibility for checking if your endpoint is compatible.

Example Usage:

Create config.yml:

defaults:
  - execution: local
  - deployment: none
  - _self_

evaluation:
  tasks:
    - name: bfclv2_ast_prompting
    - name: tooltalk

Run evaluation:

export NGC_API_KEY=nvapi-...

nemo-evaluator-launcher run \
    --config ./config.yml \
    -o execution.output_dir=results \
    -o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o +target.api_endpoint.api_key_name=NGC_API_KEY

Vision-Language Models#

Container

Description

NGC Catalog

Benchmarks

vlmevalkit

Vision-language model evaluation

Link

AI2D, ChartQA, MMMU, MathVista-MINI, OCRBench, SlideVQA

Note

The tasks in this category require a VLM chat endpoint. See Testing Endpoint Compatibility for checking if your endpoint is compatible.

Example Usage:

Create config.yml:

defaults:
  - execution: local
  - deployment: none
  - _self_

evaluation:
  tasks:
    - name: ocrbench
    - name: chartqa

Run evaluation:

export NGC_API_KEY=nvapi-...

nemo-evaluator-launcher run \
    --config ./config.yml \
    -o execution.output_dir=results \
    -o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o +target.api_endpoint.api_key_name=NGC_API_KEY

Domain-Specific#

Container

Description

NGC Catalog

Benchmarks

helm

Holistic evaluation framework

Link

MedHelm

Example Usage:

Create config.yml:

defaults:
  - execution: local
  - deployment: none
  - _self_

evaluation:
  tasks:
    - name: pubmed_qa
    - name: medcalc_bench

Run evaluation:

export NGC_API_KEY=nvapi-...

nemo-evaluator-launcher run \
    --config ./config.yml \
    -o execution.output_dir=results \
    -o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
    -o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
    -o +target.api_endpoint.api_key_name=NGC_API_KEY

Container Details#

For detailed specifications of each container, see NeMo Evaluator Containers.

Quick Container Access#

Pull and run any evaluation container directly:

# Academic benchmarks
docker pull nvcr.io/nvidia/eval-factory/simple-evals:25.10
docker run --rm -it nvcr.io/nvidia/eval-factory/simple-evals:25.10

# Code generation
docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.10
docker run --rm -it nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.10

# Safety evaluation
docker pull nvcr.io/nvidia/eval-factory/safety-harness:25.10
docker run --rm -it nvcr.io/nvidia/eval-factory/safety-harness:25.10

Available Tasks by Container#

For a complete list of available tasks in each container:

# List tasks in any container
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.10 nemo-evaluator ls

# Or use the launcher for unified access
nemo-evaluator-launcher ls tasks

Integration Patterns#

NeMo Evaluator provides multiple integration options to fit your workflow:

# Launcher CLI (recommended for most users)
nemo-evaluator-launcher ls tasks
nemo-evaluator-launcher run --config ./local_mmlu_evaluation.yaml

# Container direct execution
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.10 nemo-evaluator ls

# Python API (for programmatic control)
# See the Python API documentation for details

Benchmark Selection Best Practices#

For Model Development#

Iterative Testing:

  • Start with limit_samples=100 for quick feedback during development

  • Run full evaluations before major releases

  • Track metrics over time to measure improvement

Configuration:

# Development testing
params = ConfigParams(
    limit_samples=100,      # Quick iteration
    temperature=0.01,       # Deterministic
    parallelism=4
)

# Production evaluation
params = ConfigParams(
    limit_samples=None,     # Full dataset
    temperature=0.01,       # Deterministic
    parallelism=8          # Higher throughput
)

For Specialized Domains#

  • Code Models: Focus on humaneval, mbpp, livecodebench

  • Instruction Models: Emphasize ifbench, mtbench

  • Multilingual Models: Include arc_multilingual, hellaswag_multilingual, mgsm

  • Safety-Critical: Prioritize safety-harness and garak evaluations

Full Benchmarks List#

Container

Description

NGC Catalog

Latest Tag

Key Benchmarks

bfcl

Function calling evaluation

NGC

25.10

bfclv2, bfclv2_ast, bfclv2_ast_prompting, bfclv3, bfclv3_ast, bfclv3_ast_prompting

bigcode-evaluation-harness

Code generation evaluation

NGC

25.10

humaneval, humaneval_instruct, humanevalplus, mbpp, mbppplus, mbppplus_nemo, multiple-cpp, multiple-cs, multiple-d, multiple-go, multiple-java, multiple-jl, multiple-js, multiple-lua, multiple-php, multiple-pl, multiple-py, multiple-r, multiple-rb, multiple-rkt, multiple-rs, multiple-scala, multiple-sh, multiple-swift, multiple-ts

compute-eval

CUDA code evaluation

NGC

25.10

cccl_problems, combined_problems, cuda_problems

garak

Security and robustness testing

NGC

25.10

garak

genai-perf

GenAI performance benchmarking

NGC

25.10

genai_perf_generation, genai_perf_summarization

helm

Holistic evaluation framework

NGC

25.10

aci_bench, ehr_sql, head_qa, med_dialog_healthcaremagic, med_dialog_icliniq, medbullets, medcalc_bench, medec, medhallu, medi_qa, medication_qa, mtsamples_procedures, mtsamples_replicate, pubmed_qa, race_based_med

hle

Academic knowledge and problem solving

NGC

25.10

hle, hle_aa_v2

ifbench

Instruction following evaluation

NGC

25.10

ifbench, ifbench_aa_v2

livecodebench

Live coding evaluation

NGC

25.10

AA_codegeneration, codeexecution_v2, codeexecution_v2_cot, codegeneration_notfast, codegeneration_release_latest, codegeneration_release_v1, codegeneration_release_v2, codegeneration_release_v3, codegeneration_release_v4, codegeneration_release_v5, codegeneration_release_v6, livecodebench_0724_0125, livecodebench_0824_0225, livecodebench_aa_v2, testoutputprediction

lm-evaluation-harness

Language model benchmarks

NGC

25.10

adlr_arc_challenge_llama, adlr_gsm8k_fewshot_cot, adlr_humaneval_greedy, adlr_humanevalplus_greedy, adlr_mbpp_sanitized_3shot_greedy, adlr_mbppplus_greedy_sanitized, adlr_minerva_math_nemo, adlr_mmlu_pro_5_shot_base, adlr_race, adlr_truthfulqa_mc2, agieval, arc_challenge, arc_challenge_chat, arc_multilingual, bbh, bbh_instruct, bbq, commonsense_qa, frames_naive, frames_naive_with_links, frames_oracle, global_mmlu, global_mmlu_ar, global_mmlu_bn, global_mmlu_de, global_mmlu_en, global_mmlu_es, global_mmlu_fr, global_mmlu_full, global_mmlu_full_am, global_mmlu_full_ar, global_mmlu_full_bn, global_mmlu_full_cs, global_mmlu_full_de, global_mmlu_full_el, global_mmlu_full_en, global_mmlu_full_es, global_mmlu_full_fa, global_mmlu_full_fil, global_mmlu_full_fr, global_mmlu_full_ha, global_mmlu_full_he, global_mmlu_full_hi, global_mmlu_full_id, global_mmlu_full_ig, global_mmlu_full_it, global_mmlu_full_ja, global_mmlu_full_ko, global_mmlu_full_ky, global_mmlu_full_lt, global_mmlu_full_mg, global_mmlu_full_ms, global_mmlu_full_ne, global_mmlu_full_nl, global_mmlu_full_ny, global_mmlu_full_pl, global_mmlu_full_pt, global_mmlu_full_ro, global_mmlu_full_ru, global_mmlu_full_si, global_mmlu_full_sn, global_mmlu_full_so, global_mmlu_full_sr, global_mmlu_full_sv, global_mmlu_full_sw, global_mmlu_full_te, global_mmlu_full_tr, global_mmlu_full_uk, global_mmlu_full_vi, global_mmlu_full_yo, global_mmlu_full_zh, global_mmlu_hi, global_mmlu_id, global_mmlu_it, global_mmlu_ja, global_mmlu_ko, global_mmlu_pt, global_mmlu_sw, global_mmlu_yo, global_mmlu_zh, gpqa, gpqa_diamond_cot, gpqa_diamond_cot_5_shot, gsm8k, gsm8k_cot_instruct, gsm8k_cot_llama, gsm8k_cot_zeroshot, gsm8k_cot_zeroshot_llama, hellaswag, hellaswag_multilingual, humaneval_instruct, ifeval, m_mmlu_id_str, mbpp_plus, mgsm, mgsm_cot, mmlu, mmlu_cot_0_shot_chat, mmlu_instruct, mmlu_logits, mmlu_pro, mmlu_pro_instruct, mmlu_prox, mmlu_prox_de, mmlu_prox_es, mmlu_prox_fr, mmlu_prox_it, mmlu_prox_ja, mmlu_redux, mmlu_redux_instruct, musr, openbookqa, piqa, social_iqa, truthfulqa, wikilingua, winogrande

mmath

Multilingual math reasoning

NGC

25.10

mmath_ar, mmath_en, mmath_es, mmath_fr, mmath_ja, mmath_ko, mmath_pt, mmath_th, mmath_vi, mmath_zh

mtbench

Multi-turn conversation evaluation

NGC

25.10

mtbench, mtbench-cor1

nemo-skills

Language model benchmarks (science, math, agentic)

NGC

25.10

ns_aime2024, ns_aime2025, ns_bfcl_v3, ns_gpqa, ns_hle, ns_livecodebench, ns_mmlu, ns_mmlu_pro

profbench

Professional domains in Business and Scientific Research

NGC

25.10

llm_judge, report_generation

safety-harness

Safety and bias evaluation

NGC

25.10

aegis_v2, aegis_v2_reasoning, wildguard

scicode

Coding for scientific research

NGC

25.10

aa_scicode, scicode, scicode_aa_v2, scicode_background

simple-evals

Basic evaluation tasks

NGC

25.10

AA_AIME_2024, AA_math_test_500, AIME_2024, AIME_2025, AIME_2025_aa_v2, aime_2024_nemo, aime_2025_nemo, browsecomp, gpqa_diamond, gpqa_diamond_aa_v2, gpqa_diamond_aa_v2_llama_4, gpqa_diamond_nemo, gpqa_extended, gpqa_main, healthbench, healthbench_consensus, healthbench_hard, humaneval, humanevalplus, math_test_500, math_test_500_nemo, mgsm, mgsm_aa_v2, mmlu, mmlu_am, mmlu_ar, mmlu_ar-lite, mmlu_bn, mmlu_bn-lite, mmlu_cs, mmlu_de, mmlu_de-lite, mmlu_el, mmlu_en, mmlu_en-lite, mmlu_es, mmlu_es-lite, mmlu_fa, mmlu_fil, mmlu_fr, mmlu_fr-lite, mmlu_ha, mmlu_he, mmlu_hi, mmlu_hi-lite, mmlu_id, mmlu_id-lite, mmlu_ig, mmlu_it, mmlu_it-lite, mmlu_ja, mmlu_ja-lite, mmlu_ko, mmlu_ko-lite, mmlu_ky, mmlu_llama_4, mmlu_lt, mmlu_mg, mmlu_ms, mmlu_my-lite, mmlu_ne, mmlu_nl, mmlu_ny, mmlu_pl, mmlu_pro, mmlu_pro_aa_v2, mmlu_pro_llama_4, mmlu_pt, mmlu_pt-lite, mmlu_ro, mmlu_ru, mmlu_si, mmlu_sn, mmlu_so, mmlu_sr, mmlu_sv, mmlu_sw, mmlu_sw-lite, mmlu_te, mmlu_tr, mmlu_uk, mmlu_vi, mmlu_yo, mmlu_yo-lite, mmlu_zh-lite, simpleqa

tooltalk

Tool usage evaluation

NGC

25.10

tooltalk

vlmevalkit

Vision-language model evaluation

NGC

25.10

ai2d_judge, chartqa, mathvista-mini, mmmu_judge, ocrbench, slidevqa

Next Steps#