NeMo Evaluator Containers#
NeMo Evaluator provides a collection of specialized containers for different evaluation frameworks and tasks. Each container is optimized and tested to work seamlessly with NVIDIA hardware and software stack, providing consistent, reproducible environments for AI model evaluation.
NGC Container Catalog#
Container |
Description |
NGC Catalog |
Latest Tag |
Key Benchmarks |
---|---|---|---|---|
agentic_eval |
Agentic AI evaluation framework |
|
agentic_eval_answer_accuracy, agentic_eval_goal_accuracy_with_reference, agentic_eval_goal_accuracy_without_reference, agentic_eval_topic_adherence, agentic_eval_tool_call_accuracy |
|
bfcl |
Function calling evaluation |
|
bfclv2, bfclv2_ast, bfclv2_ast_prompting, bfclv3, bfclv3_ast, bfclv3_ast_prompting |
|
bigcode-evaluation-harness |
Code generation evaluation |
|
humaneval, humanevalplus, mbpp, mbppplus |
|
garak |
Security and robustness testing |
|
garak |
|
helm |
Holistic evaluation framework |
|
aci_bench, ehr_sql, head_qa, med_dialog_healthcaremagic |
|
hle |
Academic knowledge and problem solving |
|
hle |
|
ifbench |
Instruction following evaluation |
|
ifbench |
|
livecodebench |
Live coding evaluation |
|
livecodebench_0724_0125, livecodebench_0824_0225 |
|
lm-evaluation-harness |
Language model benchmarks |
|
mmlu, gsm8k, hellaswag, arc_challenge, truthfulqa |
|
mmath |
Multilingual math reasoning |
|
mmath_ar, mmath_en, mmath_es, mmath_fr, mmath_zh |
|
mtbench |
Multi-turn conversation evaluation |
|
mtbench, mtbench-cor1 |
|
rag_retriever_eval |
RAG system evaluation |
|
RAG, Retriever |
|
safety-harness |
Safety and bias evaluation |
|
aegis_v2 |
|
scicode |
Coding for scientific research |
|
scicode, scicode_background |
|
simple-evals |
Basic evaluation tasks |
|
mmlu, mmlu_pro, gpqa_diamond, humaneval, math_test_500 |
|
tooltalk |
Tool usage evaluation |
|
tooltalk |
|
vlmevalkit |
Vision-language model evaluation |
|
ai2d_judge, chartqa, ocrbench, slidevqa |
Container Categories#
Containers for evaluating large language models across academic benchmarks and custom tasks.
Specialized containers for evaluating code generation and programming capabilities.
Multimodal evaluation containers for vision-language understanding and reasoning.
Containers focused on safety evaluation, bias detection, and security testing.
Quick Start#
Basic Container Usage#
# Pull a container
docker pull nvcr.io/nvidia/eval-factory/<container-name>:<tag>
# Example: Pull simple-evals container
docker pull nvcr.io/nvidia/eval-factory/simple-evals:25.08.1
# Run with GPU support
docker run --gpus all -it nvcr.io/nvidia/eval-factory/<container-name>:<tag>
Prerequisites#
Docker and NVIDIA Container Toolkit (for GPU support)
NVIDIA GPU (for GPU-accelerated evaluation)
Sufficient disk space for models and datasets
For detailed usage instructions, see Container Workflows guide.