Benchmark Catalog#
Comprehensive catalog of hundreds of benchmarks across popular evaluation harnesses, all available through NGC containers and the NeMo Evaluator platform.
Available via Launcher#
# List all available benchmarks
nemo-evaluator-launcher ls
# Output as JSON for programmatic filtering
nemo-evaluator-launcher ls --json
# Filter for specific task types (example: mmlu, gsm8k, and arc_challenge)
nemo-evaluator-launcher ls | grep -E "(mmlu|gsm8k|arc)"
Available via Direct Container Access#
# List benchmarks available in the container
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.10 nemo-evaluator ls
Choosing Benchmarks for Academic Research#
Benchmark Selection Guide
For General Knowledge:
mmlu_pro- Expert-level knowledge across 14 domainsgpqa_diamond- Graduate-level science questions
For Mathematical & Quantitative Reasoning:
AIME_2025- American Invitational Mathematics Examination (AIME) 2025 questionsmgsm- Multilingual math reasoning
For Instruction Following & Alignment:
ifbench- Precise instruction followingmtbench- Multi-turn conversation quality
See benchmark categories below and Full Benchmarks List for more details.
Benchmark Categories#
Academic and Reasoning#
Container |
Description |
NGC Catalog |
Benchmarks |
|---|---|---|---|
simple-evals |
Common evaluation tasks |
GPQA-D, MATH-500, AIME 24 & 25, HumanEval, HumanEval+, MGSM, MMLU (also multilingual), MMLU-Pro, MMLU-lite (AR, BN, DE, EN, ES, FR, HI, ID, IT, JA, KO, MY, PT, SW, YO, ZH), SimpleQA, BrowseComp, HealthBench |
|
lm-evaluation-harness |
Language model benchmarks |
ARC Challenge (also multilingual), GSM8K, HumanEval, HumanEval+, MBPP, MBPP+, MINERVA Math, RACE, AGIEval, BBH, BBQ, CSQA, Frames, Global MMLU, GPQA-D, HellaSwag (also multilingual), IFEval, MGSM, MMLU, MMLU-Pro, MMLU-ProX (de, es, fr, it, ja), MMLU-Redux, MUSR, OpenbookQA, Piqa, Social IQa, TruthfulQA, WikiLingua, WinoGrande |
|
hle |
Academic knowledge and problem solving |
HLE |
|
ifbench |
Instruction following |
IFBench |
|
mtbench |
Multi-turn conversation evaluation |
MT-Bench |
|
nemo-skills |
Language model benchmarks (science, math, agentic) |
AIME 24 & 25, BFCL_v3, GPQA, HLE, LiveCodeBench, MMLU, MMLU-Pro |
|
profbench |
Evaluation of professional knowledge accross Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA |
Report Gerenation, LLM Judge |
Note
BFCL tasks from the nemo-skills container require function calling capabilities. See Testing Endpoint Compatibility for checking if your endpoint is compatible.
Example Usage:
Create config.yml:
defaults:
- execution: local
- deployment: none
- _self_
evaluation:
tasks:
- name: ifeval
- name: gsm8k_cot_instruct
- name: gpqa_diamond
Run evaluation:
export NGC_API_KEY=nvapi-...
export HF_TOKEN=hf_...
nemo-evaluator-launcher run \
--config ./config.yml \
-o execution.output_dir=results \
-o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
-o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o +target.api_endpoint.api_key_name=NGC_API_KEY
Code Generation#
Container |
Description |
NGC Catalog |
Benchmarks |
|---|---|---|---|
bigcode-evaluation-harness |
Code generation evaluation |
MBPP, MBPP-Plus, HumanEval, HumanEval+, Multiple (cpp, cs, d, go, java, jl, js, lua, php, pl, py, r, rb, rkt, rs, scala, sh, swift, ts) |
|
livecodebench |
Coding |
LiveCodeBench (v1-v6, 0724_0125, 0824_0225) |
|
scicode |
Coding for scientific research |
SciCode |
Example Usage:
Create config.yml:
defaults:
- execution: local
- deployment: none
- _self_
evaluation:
tasks:
- name: humaneval_instruct
- name: mbbp
Run evaluation:
export NGC_API_KEY=nvapi-...
nemo-evaluator-launcher run \
--config ./config.yml \
-o execution.output_dir=results \
-o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
-o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o +target.api_endpoint.api_key_name=NGC_API_KEY
Safety and Security#
Container |
Description |
NGC Catalog |
Benchmarks |
|---|---|---|---|
garak |
Safety and vulnerability testing |
Garak |
|
safety-harness |
Safety and bias evaluation |
Aegis v2, WildGuard |
Example Usage:
Create config.yml:
defaults:
- execution: local
- deployment: none
- _self_
evaluation:
tasks:
- name: aegis_v2
- name: garak
Run evaluation:
export NGC_API_KEY=nvapi-...
export HF_TOKEN=hf_...
nemo-evaluator-launcher run \
--config ./config.yml \
-o execution.output_dir=results \
-o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
-o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o +target.api_endpoint.api_key_name=NGC_API_KEY
Function Calling#
Container |
Description |
NGC Catalog |
Benchmarks |
|---|---|---|---|
bfcl |
Function calling |
BFCL v2 and v3 |
|
tooltalk |
Tool usage evaluation |
ToolTalk |
Note
Some of the tasks in this category require function calling capabilities. See Testing Endpoint Compatibility for checking if your endpoint is compatible.
Example Usage:
Create config.yml:
defaults:
- execution: local
- deployment: none
- _self_
evaluation:
tasks:
- name: bfclv2_ast_prompting
- name: tooltalk
Run evaluation:
export NGC_API_KEY=nvapi-...
nemo-evaluator-launcher run \
--config ./config.yml \
-o execution.output_dir=results \
-o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
-o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o +target.api_endpoint.api_key_name=NGC_API_KEY
Vision-Language Models#
Container |
Description |
NGC Catalog |
Benchmarks |
|---|---|---|---|
vlmevalkit |
Vision-language model evaluation |
AI2D, ChartQA, MMMU, MathVista-MINI, OCRBench, SlideVQA |
Note
The tasks in this category require a VLM chat endpoint. See Testing Endpoint Compatibility for checking if your endpoint is compatible.
Example Usage:
Create config.yml:
defaults:
- execution: local
- deployment: none
- _self_
evaluation:
tasks:
- name: ocrbench
- name: chartqa
Run evaluation:
export NGC_API_KEY=nvapi-...
nemo-evaluator-launcher run \
--config ./config.yml \
-o execution.output_dir=results \
-o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
-o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o +target.api_endpoint.api_key_name=NGC_API_KEY
Domain-Specific#
Container |
Description |
NGC Catalog |
Benchmarks |
|---|---|---|---|
helm |
Holistic evaluation framework |
MedHelm |
Example Usage:
Create config.yml:
defaults:
- execution: local
- deployment: none
- _self_
evaluation:
tasks:
- name: pubmed_qa
- name: medcalc_bench
Run evaluation:
export NGC_API_KEY=nvapi-...
nemo-evaluator-launcher run \
--config ./config.yml \
-o execution.output_dir=results \
-o +target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
-o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o +target.api_endpoint.api_key_name=NGC_API_KEY
Container Details#
For detailed specifications of each container, see NeMo Evaluator Containers.
Quick Container Access#
Pull and run any evaluation container directly:
# Academic benchmarks
docker pull nvcr.io/nvidia/eval-factory/simple-evals:25.10
docker run --rm -it nvcr.io/nvidia/eval-factory/simple-evals:25.10
# Code generation
docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.10
docker run --rm -it nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.10
# Safety evaluation
docker pull nvcr.io/nvidia/eval-factory/safety-harness:25.10
docker run --rm -it nvcr.io/nvidia/eval-factory/safety-harness:25.10
Available Tasks by Container#
For a complete list of available tasks in each container:
# List tasks in any container
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.10 nemo-evaluator ls
# Or use the launcher for unified access
nemo-evaluator-launcher ls tasks
Integration Patterns#
NeMo Evaluator provides multiple integration options to fit your workflow:
# Launcher CLI (recommended for most users)
nemo-evaluator-launcher ls tasks
nemo-evaluator-launcher run --config ./local_mmlu_evaluation.yaml
# Container direct execution
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.10 nemo-evaluator ls
# Python API (for programmatic control)
# See the Python API documentation for details
Benchmark Selection Best Practices#
For Model Development#
Iterative Testing:
Start with
limit_samples=100for quick feedback during developmentRun full evaluations before major releases
Track metrics over time to measure improvement
Configuration:
# Development testing
params = ConfigParams(
limit_samples=100, # Quick iteration
temperature=0.01, # Deterministic
parallelism=4
)
# Production evaluation
params = ConfigParams(
limit_samples=None, # Full dataset
temperature=0.01, # Deterministic
parallelism=8 # Higher throughput
)
For Specialized Domains#
Code Models: Focus on
humaneval,mbpp,livecodebenchInstruction Models: Emphasize
ifbench,mtbenchMultilingual Models: Include
arc_multilingual,hellaswag_multilingual,mgsmSafety-Critical: Prioritize
safety-harnessandgarakevaluations
Full Benchmarks List#
Container |
Description |
NGC Catalog |
Latest Tag |
Key Benchmarks |
|---|---|---|---|---|
bfcl |
Function calling evaluation |
25.10 |
bfclv2, bfclv2_ast, bfclv2_ast_prompting, bfclv3, bfclv3_ast, bfclv3_ast_prompting |
|
bigcode-evaluation-harness |
Code generation evaluation |
25.10 |
humaneval, humaneval_instruct, humanevalplus, mbpp, mbppplus, mbppplus_nemo, multiple-cpp, multiple-cs, multiple-d, multiple-go, multiple-java, multiple-jl, multiple-js, multiple-lua, multiple-php, multiple-pl, multiple-py, multiple-r, multiple-rb, multiple-rkt, multiple-rs, multiple-scala, multiple-sh, multiple-swift, multiple-ts |
|
compute-eval |
CUDA code evaluation |
25.10 |
cccl_problems, combined_problems, cuda_problems |
|
garak |
Security and robustness testing |
25.10 |
garak |
|
genai-perf |
GenAI performance benchmarking |
25.10 |
genai_perf_generation, genai_perf_summarization |
|
helm |
Holistic evaluation framework |
25.10 |
aci_bench, ehr_sql, head_qa, med_dialog_healthcaremagic, med_dialog_icliniq, medbullets, medcalc_bench, medec, medhallu, medi_qa, medication_qa, mtsamples_procedures, mtsamples_replicate, pubmed_qa, race_based_med |
|
hle |
Academic knowledge and problem solving |
25.10 |
hle, hle_aa_v2 |
|
ifbench |
Instruction following evaluation |
25.10 |
ifbench, ifbench_aa_v2 |
|
livecodebench |
Live coding evaluation |
25.10 |
AA_codegeneration, codeexecution_v2, codeexecution_v2_cot, codegeneration_notfast, codegeneration_release_latest, codegeneration_release_v1, codegeneration_release_v2, codegeneration_release_v3, codegeneration_release_v4, codegeneration_release_v5, codegeneration_release_v6, livecodebench_0724_0125, livecodebench_0824_0225, livecodebench_aa_v2, testoutputprediction |
|
lm-evaluation-harness |
Language model benchmarks |
25.10 |
adlr_arc_challenge_llama, adlr_gsm8k_fewshot_cot, adlr_humaneval_greedy, adlr_humanevalplus_greedy, adlr_mbpp_sanitized_3shot_greedy, adlr_mbppplus_greedy_sanitized, adlr_minerva_math_nemo, adlr_mmlu_pro_5_shot_base, adlr_race, adlr_truthfulqa_mc2, agieval, arc_challenge, arc_challenge_chat, arc_multilingual, bbh, bbh_instruct, bbq, commonsense_qa, frames_naive, frames_naive_with_links, frames_oracle, global_mmlu, global_mmlu_ar, global_mmlu_bn, global_mmlu_de, global_mmlu_en, global_mmlu_es, global_mmlu_fr, global_mmlu_full, global_mmlu_full_am, global_mmlu_full_ar, global_mmlu_full_bn, global_mmlu_full_cs, global_mmlu_full_de, global_mmlu_full_el, global_mmlu_full_en, global_mmlu_full_es, global_mmlu_full_fa, global_mmlu_full_fil, global_mmlu_full_fr, global_mmlu_full_ha, global_mmlu_full_he, global_mmlu_full_hi, global_mmlu_full_id, global_mmlu_full_ig, global_mmlu_full_it, global_mmlu_full_ja, global_mmlu_full_ko, global_mmlu_full_ky, global_mmlu_full_lt, global_mmlu_full_mg, global_mmlu_full_ms, global_mmlu_full_ne, global_mmlu_full_nl, global_mmlu_full_ny, global_mmlu_full_pl, global_mmlu_full_pt, global_mmlu_full_ro, global_mmlu_full_ru, global_mmlu_full_si, global_mmlu_full_sn, global_mmlu_full_so, global_mmlu_full_sr, global_mmlu_full_sv, global_mmlu_full_sw, global_mmlu_full_te, global_mmlu_full_tr, global_mmlu_full_uk, global_mmlu_full_vi, global_mmlu_full_yo, global_mmlu_full_zh, global_mmlu_hi, global_mmlu_id, global_mmlu_it, global_mmlu_ja, global_mmlu_ko, global_mmlu_pt, global_mmlu_sw, global_mmlu_yo, global_mmlu_zh, gpqa, gpqa_diamond_cot, gpqa_diamond_cot_5_shot, gsm8k, gsm8k_cot_instruct, gsm8k_cot_llama, gsm8k_cot_zeroshot, gsm8k_cot_zeroshot_llama, hellaswag, hellaswag_multilingual, humaneval_instruct, ifeval, m_mmlu_id_str, mbpp_plus, mgsm, mgsm_cot, mmlu, mmlu_cot_0_shot_chat, mmlu_instruct, mmlu_logits, mmlu_pro, mmlu_pro_instruct, mmlu_prox, mmlu_prox_de, mmlu_prox_es, mmlu_prox_fr, mmlu_prox_it, mmlu_prox_ja, mmlu_redux, mmlu_redux_instruct, musr, openbookqa, piqa, social_iqa, truthfulqa, wikilingua, winogrande |
|
mmath |
Multilingual math reasoning |
25.10 |
mmath_ar, mmath_en, mmath_es, mmath_fr, mmath_ja, mmath_ko, mmath_pt, mmath_th, mmath_vi, mmath_zh |
|
mtbench |
Multi-turn conversation evaluation |
25.10 |
mtbench, mtbench-cor1 |
|
nemo-skills |
Language model benchmarks (science, math, agentic) |
25.10 |
ns_aime2024, ns_aime2025, ns_bfcl_v3, ns_gpqa, ns_hle, ns_livecodebench, ns_mmlu, ns_mmlu_pro |
|
profbench |
Professional domains in Business and Scientific Research |
25.10 |
llm_judge, report_generation |
|
safety-harness |
Safety and bias evaluation |
25.10 |
aegis_v2, aegis_v2_reasoning, wildguard |
|
scicode |
Coding for scientific research |
25.10 |
aa_scicode, scicode, scicode_aa_v2, scicode_background |
|
simple-evals |
Basic evaluation tasks |
25.10 |
AA_AIME_2024, AA_math_test_500, AIME_2024, AIME_2025, AIME_2025_aa_v2, aime_2024_nemo, aime_2025_nemo, browsecomp, gpqa_diamond, gpqa_diamond_aa_v2, gpqa_diamond_aa_v2_llama_4, gpqa_diamond_nemo, gpqa_extended, gpqa_main, healthbench, healthbench_consensus, healthbench_hard, humaneval, humanevalplus, math_test_500, math_test_500_nemo, mgsm, mgsm_aa_v2, mmlu, mmlu_am, mmlu_ar, mmlu_ar-lite, mmlu_bn, mmlu_bn-lite, mmlu_cs, mmlu_de, mmlu_de-lite, mmlu_el, mmlu_en, mmlu_en-lite, mmlu_es, mmlu_es-lite, mmlu_fa, mmlu_fil, mmlu_fr, mmlu_fr-lite, mmlu_ha, mmlu_he, mmlu_hi, mmlu_hi-lite, mmlu_id, mmlu_id-lite, mmlu_ig, mmlu_it, mmlu_it-lite, mmlu_ja, mmlu_ja-lite, mmlu_ko, mmlu_ko-lite, mmlu_ky, mmlu_llama_4, mmlu_lt, mmlu_mg, mmlu_ms, mmlu_my-lite, mmlu_ne, mmlu_nl, mmlu_ny, mmlu_pl, mmlu_pro, mmlu_pro_aa_v2, mmlu_pro_llama_4, mmlu_pt, mmlu_pt-lite, mmlu_ro, mmlu_ru, mmlu_si, mmlu_sn, mmlu_so, mmlu_sr, mmlu_sv, mmlu_sw, mmlu_sw-lite, mmlu_te, mmlu_tr, mmlu_uk, mmlu_vi, mmlu_yo, mmlu_yo-lite, mmlu_zh-lite, simpleqa |
|
tooltalk |
Tool usage evaluation |
25.10 |
tooltalk |
|
vlmevalkit |
Vision-language model evaluation |
25.10 |
ai2d_judge, chartqa, mathvista-mini, mmmu_judge, ocrbench, slidevqa |
Next Steps#
Container Details: Browse NeMo Evaluator Containers for complete specifications
Custom Benchmarks: Learn Framework Definition File (FDF) for custom evaluations