About Selecting Benchmarks#
NeMo Evaluator provides a comprehensive suite of benchmarks spanning academic reasoning, code generation, safety testing, and domain-specific evaluations. Whether you’re validating a new model’s capabilities or conducting rigorous academic research, you’ll find the right benchmarks to assess your AI system’s performance. See Available Benchmarks for the complete catalog of available benchmarks.
Available via Launcher#
# List all available benchmarks
nemo-evaluator-launcher ls
# Output as JSON for programmatic filtering
nemo-evaluator-launcher ls --json
# Filter for specific task types (example: mmlu, gsm8k, and arc_challenge)
nemo-evaluator-launcher ls | grep -E "(mmlu|gsm8k|arc)"
Available via Direct Container Access#
# List benchmarks available in the container
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.10 nemo-evaluator ls
Choosing Benchmarks for Academic Research#
Benchmark Selection Guide
For General Knowledge:
mmlu_pro- Expert-level knowledge across 14 domainsgpqa_diamond- Graduate-level science questions
For Mathematical & Quantitative Reasoning:
AIME_2025- American Invitational Mathematics Examination (AIME) 2025 questionsmgsm- Multilingual math reasoning
For Instruction Following & Alignment:
ifbench- Precise instruction followingmtbench- Multi-turn conversation quality
See benchmark categories below and Available Benchmarks for more details.
Benchmark Categories#
Academic and Reasoning#
Container |
Description |
NGC Catalog |
Benchmarks |
|---|---|---|---|
simple-evals |
Common evaluation tasks |
GPQA-D, MATH-500, AIME 24 & 25, HumanEval, HumanEval+, MGSM, MMLU (also multilingual), MMLU-Pro, MMLU-lite (AR, BN, DE, EN, ES, FR, HI, ID, IT, JA, KO, MY, PT, SW, YO, ZH), SimpleQA, BrowseComp, HealthBench |
|
lm-evaluation-harness |
Language model benchmarks |
ARC Challenge (also multilingual), GSM8K, HumanEval, HumanEval+, MBPP, MBPP+, MINERVA Math, RACE, AGIEval, BBH, BBQ, CSQA, Frames, Global MMLU, GPQA-D, HellaSwag (also multilingual), IFEval, MGSM, MMLU, MMLU-Pro, MMLU-ProX (de, es, fr, it, ja), MMLU-Redux, MUSR, OpenbookQA, Piqa, Social IQa, TruthfulQA, WikiLingua, WinoGrande |
|
hle |
Academic knowledge and problem solving |
HLE |
|
ifbench |
Instruction following |
IFBench |
|
mtbench |
Multi-turn conversation evaluation |
MT-Bench |
|
nemo-skills |
Language model benchmarks (science, math, agentic) |
AIME 24 & 25, BFCL_v3, GPQA, HLE, LiveCodeBench, MMLU, MMLU-Pro |
|
profbench |
Evaluation of professional knowledge accross Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA |
Report Gerenation, LLM Judge |
Note
BFCL tasks from the nemo-skills container require function calling capabilities. See Testing Endpoint Compatibility for checking if your endpoint is compatible.
Example Usage:
Create config.yml:
defaults:
- execution: local
- deployment: none
- _self_
evaluation:
tasks:
- name: ifeval
- name: gsm8k_cot_instruct
- name: gpqa_diamond
Run evaluation:
export NGC_API_KEY=nvapi-...
export HF_TOKEN=hf_...
nemo-evaluator-launcher run \
--config ./config.yml \
-o execution.output_dir=results \
-o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
-o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o +target.api_endpoint.api_key_name=NGC_API_KEY
Code Generation#
Container |
Description |
NGC Catalog |
Benchmarks |
|---|---|---|---|
bigcode-evaluation-harness |
Code generation evaluation |
MBPP, MBPP-Plus, HumanEval, HumanEval+, Multiple (cpp, cs, d, go, java, jl, js, lua, php, pl, py, r, rb, rkt, rs, scala, sh, swift, ts) |
|
livecodebench |
Coding |
LiveCodeBench (v1-v6, 0724_0125, 0824_0225) |
|
scicode |
Coding for scientific research |
SciCode |
Example Usage:
Create config.yml:
defaults:
- execution: local
- deployment: none
- _self_
evaluation:
tasks:
- name: humaneval_instruct
- name: mbbp
Run evaluation:
export NGC_API_KEY=nvapi-...
nemo-evaluator-launcher run \
--config ./config.yml \
-o execution.output_dir=results \
-o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
-o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o +target.api_endpoint.api_key_name=NGC_API_KEY
Safety and Security#
Container |
Description |
NGC Catalog |
Benchmarks |
|---|---|---|---|
garak |
Safety and vulnerability testing |
Garak |
|
safety-harness |
Safety and bias evaluation |
Aegis v2, WildGuard |
Example Usage:
Create config.yml:
defaults:
- execution: local
- deployment: none
- _self_
evaluation:
tasks:
- name: aegis_v2
- name: garak
Run evaluation:
export NGC_API_KEY=nvapi-...
export HF_TOKEN=hf_...
nemo-evaluator-launcher run \
--config ./config.yml \
-o execution.output_dir=results \
-o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
-o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o +target.api_endpoint.api_key_name=NGC_API_KEY
Function Calling#
Container |
Description |
NGC Catalog |
Benchmarks |
|---|---|---|---|
bfcl |
Function calling |
BFCL v2 and v3 |
|
tooltalk |
Tool usage evaluation |
ToolTalk |
Note
Some of the tasks in this category require function calling capabilities. See Testing Endpoint Compatibility for checking if your endpoint is compatible.
Example Usage:
Create config.yml:
defaults:
- execution: local
- deployment: none
- _self_
evaluation:
tasks:
- name: bfclv2_ast_prompting
- name: tooltalk
Run evaluation:
export NGC_API_KEY=nvapi-...
nemo-evaluator-launcher run \
--config ./config.yml \
-o execution.output_dir=results \
-o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
-o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o +target.api_endpoint.api_key_name=NGC_API_KEY
Vision-Language Models#
Container |
Description |
NGC Catalog |
Benchmarks |
|---|---|---|---|
vlmevalkit |
Vision-language model evaluation |
AI2D, ChartQA, MMMU, MathVista-MINI, OCRBench, SlideVQA |
Note
The tasks in this category require a VLM chat endpoint. See Testing Endpoint Compatibility for checking if your endpoint is compatible.
Example Usage:
Create config.yml:
defaults:
- execution: local
- deployment: none
- _self_
evaluation:
tasks:
- name: ocrbench
- name: chartqa
Run evaluation:
export NGC_API_KEY=nvapi-...
nemo-evaluator-launcher run \
--config ./config.yml \
-o execution.output_dir=results \
-o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
-o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o +target.api_endpoint.api_key_name=NGC_API_KEY
Domain-Specific#
Container |
Description |
NGC Catalog |
Benchmarks |
|---|---|---|---|
helm |
Holistic evaluation framework |
MedHelm |
Example Usage:
Create config.yml:
defaults:
- execution: local
- deployment: none
- _self_
evaluation:
tasks:
- name: pubmed_qa
- name: medcalc_bench
Run evaluation:
export NGC_API_KEY=nvapi-...
nemo-evaluator-launcher run \
--config ./config.yml \
-o execution.output_dir=results \
-o +target.api_endpoint.model_id=meta/llama-3.2-3b-instruct \
-o +target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o +target.api_endpoint.api_key_name=NGC_API_KEY
Container Details#
For detailed specifications of each container, see NeMo Evaluator Containers.
Quick Container Access#
Pull and run any evaluation container directly:
# Academic benchmarks
docker pull nvcr.io/nvidia/eval-factory/simple-evals:25.10
docker run --rm -it nvcr.io/nvidia/eval-factory/simple-evals:25.10
# Code generation
docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.10
docker run --rm -it nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.10
# Safety evaluation
docker pull nvcr.io/nvidia/eval-factory/safety-harness:25.10
docker run --rm -it nvcr.io/nvidia/eval-factory/safety-harness:25.10
Available Tasks by Container#
For a complete list of available tasks in each container:
# List tasks in any container
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.10 nemo-evaluator ls
# Or use the launcher for unified access
nemo-evaluator-launcher ls tasks
Integration Patterns#
NeMo Evaluator provides multiple integration options to fit your workflow:
# Launcher CLI (recommended for most users)
nemo-evaluator-launcher ls tasks
nemo-evaluator-launcher run --config ./local_mmlu_evaluation.yaml
# Container direct execution
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.10 nemo-evaluator ls
# Python API (for programmatic control)
# See the Python API documentation for details
Benchmark Selection Best Practices#
For Model Development#
Iterative Testing:
Start with
limit_samples=100for quick feedback during developmentRun full evaluations before major releases
Track metrics over time to measure improvement
Configuration:
# Development testing
params = ConfigParams(
limit_samples=100, # Quick iteration
temperature=0.01, # Deterministic
parallelism=4
)
# Production evaluation
params = ConfigParams(
limit_samples=None, # Full dataset
temperature=0.01, # Deterministic
parallelism=8 # Higher throughput
)
For Specialized Domains#
Code Models: Focus on
humaneval,mbpp,livecodebenchInstruction Models: Emphasize
ifbench,mtbenchMultilingual Models: Include
arc_multilingual,hellaswag_multilingual,mgsmSafety-Critical: Prioritize
safety-harnessandgarakevaluations
Next Steps#
Container Details: Browse NeMo Evaluator Containers for complete specifications
Custom Benchmarks: Learn Framework Definition File (FDF) for custom evaluations