NeMo Evaluator Containers#

NeMo Evaluator provides a collection of specialized containers for different evaluation frameworks and tasks. Each container is optimized and tested to work seamlessly with NVIDIA hardware and software stack, providing consistent, reproducible environments for AI model evaluation.

NGC Container Catalog#

Container	Description	NGC Catalog	Latest Tag	Key Benchmarks
agentic_eval	Agentic AI evaluation framework	Link	`{{ docker_compose_latest }}`	agentic_eval_answer_accuracy, agentic_eval_goal_accuracy_with_reference, agentic_eval_goal_accuracy_without_reference, agentic_eval_topic_adherence, agentic_eval_tool_call_accuracy
bfcl	Function calling evaluation	Link	`{{ docker_compose_latest }}`	bfclv2, bfclv2_ast, bfclv2_ast_prompting, bfclv3, bfclv3_ast, bfclv3_ast_prompting
bigcode-evaluation-harness	Code generation evaluation	Link	`{{ docker_compose_latest }}`	humaneval, humanevalplus, mbpp, mbppplus
garak	Security and robustness testing	Link	`{{ docker_compose_latest }}`	garak
helm	Holistic evaluation framework	Link	`{{ docker_compose_latest }}`	aci_bench, ehr_sql, head_qa, med_dialog_healthcaremagic
hle	Academic knowledge and problem solving	Link	`{{ docker_compose_latest }}`	hle
ifbench	Instruction following evaluation	Link	`{{ docker_compose_latest }}`	ifbench
livecodebench	Live coding evaluation	Link	`{{ docker_compose_latest }}`	livecodebench_0724_0125, livecodebench_0824_0225
lm-evaluation-harness	Language model benchmarks	Link	`{{ docker_compose_latest }}`	mmlu, gsm8k, hellaswag, arc_challenge, truthfulqa
mmath	Multilingual math reasoning	Link	`{{ docker_compose_latest }}`	mmath_ar, mmath_en, mmath_es, mmath_fr, mmath_zh
mtbench	Multi-turn conversation evaluation	Link	`{{ docker_compose_latest }}`	mtbench, mtbench-cor1
rag_retriever_eval	RAG system evaluation	Link	`{{ docker_compose_latest }}`	RAG, Retriever
safety-harness	Safety and bias evaluation	Link	`{{ docker_compose_latest }}`	aegis_v2
scicode	Coding for scientific research	Link	`{{ docker_compose_latest }}`	scicode, scicode_background
simple-evals	Basic evaluation tasks	Link	`{{ docker_compose_latest }}`	mmlu, mmlu_pro, gpqa_diamond, humaneval, math_test_500
tooltalk	Tool usage evaluation	Link	`{{ docker_compose_latest }}`	tooltalk
vlmevalkit	Vision-language model evaluation	Link	`{{ docker_compose_latest }}`	ai2d_judge, chartqa, ocrbench, slidevqa

Container Categories#

Language Models

Containers for evaluating large language models across academic benchmarks and custom tasks.

Language Model Containers

Code Generation

Specialized containers for evaluating code generation and programming capabilities.

Code Generation Containers

Vision-Language

Multimodal evaluation containers for vision-language understanding and reasoning.

Vision-Language Containers

Safety & Security

Containers focused on safety evaluation, bias detection, and security testing.

Safety and Security Containers

Quick Start#

Basic Container Usage#

# Pull a container
docker pull nvcr.io/nvidia/eval-factory/<container-name>:<tag>

# Example: Pull simple-evals container
docker pull nvcr.io/nvidia/eval-factory/simple-evals:25.08.1

# Run with GPU support
docker run --gpus all -it nvcr.io/nvidia/eval-factory/<container-name>:<tag>

Prerequisites#

Docker and NVIDIA Container Toolkit (for GPU support)
NVIDIA GPU (for GPU-accelerated evaluation)
Sufficient disk space for models and datasets

For detailed usage instructions, see Container Workflows guide.