NeMo Evaluator Containers#

NeMo Evaluator provides a collection of specialized containers for different evaluation frameworks and tasks. Each container is optimized and tested to work seamlessly with NVIDIA hardware and software stack, providing consistent, reproducible environments for AI model evaluation.

NGC Container Catalog#

Container

Description

NGC Catalog

Latest Tag

Key Benchmarks

agentic_eval

Agentic AI evaluation framework

Link

{{ docker_compose_latest }}

agentic_eval_answer_accuracy, agentic_eval_goal_accuracy_with_reference, agentic_eval_goal_accuracy_without_reference, agentic_eval_topic_adherence, agentic_eval_tool_call_accuracy

bfcl

Function calling evaluation

Link

{{ docker_compose_latest }}

bfclv2, bfclv2_ast, bfclv2_ast_prompting, bfclv3, bfclv3_ast, bfclv3_ast_prompting

bigcode-evaluation-harness

Code generation evaluation

Link

{{ docker_compose_latest }}

humaneval, humanevalplus, mbpp, mbppplus

garak

Security and robustness testing

Link

{{ docker_compose_latest }}

garak

helm

Holistic evaluation framework

Link

{{ docker_compose_latest }}

aci_bench, ehr_sql, head_qa, med_dialog_healthcaremagic

hle

Academic knowledge and problem solving

Link

{{ docker_compose_latest }}

hle

ifbench

Instruction following evaluation

Link

{{ docker_compose_latest }}

ifbench

livecodebench

Live coding evaluation

Link

{{ docker_compose_latest }}

livecodebench_0724_0125, livecodebench_0824_0225

lm-evaluation-harness

Language model benchmarks

Link

{{ docker_compose_latest }}

mmlu, gsm8k, hellaswag, arc_challenge, truthfulqa

mmath

Multilingual math reasoning

Link

{{ docker_compose_latest }}

mmath_ar, mmath_en, mmath_es, mmath_fr, mmath_zh

mtbench

Multi-turn conversation evaluation

Link

{{ docker_compose_latest }}

mtbench, mtbench-cor1

rag_retriever_eval

RAG system evaluation

Link

{{ docker_compose_latest }}

RAG, Retriever

safety-harness

Safety and bias evaluation

Link

{{ docker_compose_latest }}

aegis_v2

scicode

Coding for scientific research

Link

{{ docker_compose_latest }}

scicode, scicode_background

simple-evals

Basic evaluation tasks

Link

{{ docker_compose_latest }}

mmlu, mmlu_pro, gpqa_diamond, humaneval, math_test_500

tooltalk

Tool usage evaluation

Link

{{ docker_compose_latest }}

tooltalk

vlmevalkit

Vision-language model evaluation

Link

{{ docker_compose_latest }}

ai2d_judge, chartqa, ocrbench, slidevqa


Container Categories#

Language Models

Containers for evaluating large language models across academic benchmarks and custom tasks.

Language Model Containers
Code Generation

Specialized containers for evaluating code generation and programming capabilities.

Code Generation Containers
Vision-Language

Multimodal evaluation containers for vision-language understanding and reasoning.

Vision-Language Containers
Safety & Security

Containers focused on safety evaluation, bias detection, and security testing.

Safety and Security Containers

Quick Start#

Basic Container Usage#

# Pull a container
docker pull nvcr.io/nvidia/eval-factory/<container-name>:<tag>

# Example: Pull simple-evals container
docker pull nvcr.io/nvidia/eval-factory/simple-evals:25.08.1

# Run with GPU support
docker run --gpus all -it nvcr.io/nvidia/eval-factory/<container-name>:<tag>

Prerequisites#

  • Docker and NVIDIA Container Toolkit (for GPU support)

  • NVIDIA GPU (for GPU-accelerated evaluation)

  • Sufficient disk space for models and datasets

For detailed usage instructions, see Container Workflows guide.