Language Model Containers#
Containers specialized for evaluating large language models across academic benchmarks, custom tasks, and conversation scenarios.
Simple-Evals Container#
NGC Catalog: simple-evals
Container for lightweight evaluation tasks and simple model assessments.
Use Cases:
Simple question-answering evaluation
Math and reasoning capabilities
Basic Python coding
Pull Command:
docker pull nvcr.io/nvidia/eval-factory/simple-evals:25.09
Default Parameters:
Parameter |
Value |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
LM-Evaluation-Harness Container#
NGC Catalog: lm-evaluation-harness
Container based on the Language Model Evaluation Harness framework for comprehensive language model evaluation.
Use Cases:
Standard NLP benchmarks
Language model performance evaluation
Multi-task assessment
Academic benchmark evaluation
Pull Command:
docker pull nvcr.io/nvidia/eval-factory/lm-evaluation-harness:25.09
Default Parameters:
Parameter |
Value |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
MT-Bench Container#
NGC Catalog: mtbench
Container for MT-Bench evaluation framework, designed for multi-turn conversation evaluation.
Use Cases:
Multi-turn dialogue evaluation
Conversation quality assessment
Context maintenance evaluation
Interactive AI system testing
Pull Command:
docker pull nvcr.io/nvidia/eval-factory/mtbench:25.09
Default Parameters:
Parameter |
Value |
---|---|
|
|
|
|
|
|
|
|
|
|
HELM Container#
NGC Catalog: helm
Container for the Holistic Evaluation of Language Models (HELM) framework, with a focus on MedHELM - an extensible evaluation framework for assessing LLM performance for medical tasks.
Use Cases:
Medical AI model evaluation
Clinical task assessment
Healthcare-specific benchmarking
Diagnostic decision-making evaluation
Patient communication assessment
Medical knowledge evaluation
Pull Command:
docker pull nvcr.io/nvidia/eval-factory/helm:25.09
Default Parameters:
Parameter |
Value |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
RAG Retriever Evaluation Container#
NGC Catalog: rag_retriever_eval
Container for evaluating Retrieval-Augmented Generation (RAG) systems and their retrieval capabilities.
Use Cases:
Document retrieval accuracy
Context relevance assessment
RAG pipeline evaluation
Information retrieval performance
Pull Command:
docker pull nvcr.io/nvidia/eval-factory/rag_retriever_eval:25.09
HLE Container#
NGC Catalog: hle
Container for Humanity’s Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark with broad subject coverage.
Use Cases:
Academic knowledge and problem solving evaluation
Multi-modal benchmark testing
Frontier knowledge assessment
Subject-matter expertise evaluation
Pull Command:
docker pull nvcr.io/nvidia/eval-factory/hle:25.09
Default Parameters:
Parameter |
Value |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
IFBench Container#
NGC Catalog: ifbench
Container for a challenging benchmark for precise instruction following evaluation.
Use Cases:
Precise instruction following evaluation
Out-of-distribution constraint verification
Multiturn constraint isolation testing
Instruction following robustness assessment
Verifiable instruction compliance testing
Pull Command:
docker pull nvcr.io/nvidia/eval-factory/ifbench:25.09
Default Parameters:
Parameter |
Value |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
MMATH Container#
NGC Catalog: mmath
Container for multilingual mathematical reasoning evaluation across multiple languages.
Use Cases:
Multilingual mathematical reasoning evaluation
Cross-lingual mathematical problem solving assessment
Mathematical reasoning robustness across languages
Complex mathematical reasoning capability testing
Translation quality validation for mathematical content
Pull Command:
docker pull nvcr.io/nvidia/eval-factory/mmath:25.09
Default Parameters:
Parameter |
Value |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Supported Languages: EN, ZH, AR, ES, FR, JA, KO, PT, TH, VI