Installation Guide#
NeMo Evaluator provides multiple installation paths depending on your needs. Choose the approach that best fits your use case.
Choose Your Installation Path#
Installation Path |
Best For |
Key Features |
---|---|---|
NeMo Evaluator Launcher (Recommended) |
Most users who want unified CLI and orchestration across backends |
• Unified CLI for 100+ benchmarks |
NeMo Evaluator Core |
Developers building custom evaluation pipelines |
• Programmatic Python API |
Container Direct |
Users who prefer container-based workflows |
• Pre-built NGC evaluation containers |
Prerequisites#
System Requirements#
Python 3.10 or higher (supports 3.10, 3.11, 3.12, and 3.13)
CUDA-compatible GPU(s) (tested on RTX A6000, A100, H100)
Docker (for container-based workflows)
Recommended Environment#
Python 3.12
PyTorch 2.7
CUDA 12.9
Ubuntu 24.04
Installation Methods#
Use pip#
Install NeMo Evaluator Launcher for unified CLI and orchestration:
# Create and activate virtual environment
python3 -m venv nemo-eval-env
source nemo-eval-env/bin/activate
# Install launcher with all exporters (recommended)
pip install nemo-evaluator-launcher[all]
Quick verification:
# Verify installation
nv-eval --version
# Test basic functionality - list available tasks
nv-eval ls tasks | head -10
Install NeMo Evaluator Core for programmatic access:
# Create and activate virtual environment
python3 -m venv nemo-eval-env
source nemo-eval-env/bin/activate
# Install core library with dependencies
pip install torch==2.7.0 setuptools pybind11 wheel_stub # Required for TE
pip install --no-build-isolation nemo-evaluator
# Install evaluation frameworks
pip install nvidia-simple-evals nvidia-lm-eval
Quick verification:
print("✓ Core library installed successfully")
print("✓ Adapter system ready")
Use pre-built evaluation containers from NVIDIA NGC for guaranteed reproducibility:
# Pull evaluation containers (no local installation needed)
docker pull nvcr.io/nvidia/eval-factory/simple-evals:25.08.1
docker pull nvcr.io/nvidia/eval-factory/lm-evaluation-harness:25.08.1
docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.08.1
# Run container interactively
docker run --rm -it --gpus all \
-v $(pwd)/results:/workspace/results \
nvcr.io/nvidia/eval-factory/simple-evals:25.08.1 bash
# Or run evaluation directly
docker run --rm --gpus all \
-v $(pwd)/results:/workspace/results \
-e MY_API_KEY=your-api-key \
nvcr.io/nvidia/eval-factory/simple-evals:25.08.1 \
eval-factory run_eval \
--eval_type mmlu_pro \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--model_id meta/llama-3.1-8b-instruct \
--api_key_name MY_API_KEY \
--output_dir /workspace/results
Quick verification:
# Test container access
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.08.1 \
eval-factory ls | head -5
echo " Container access verified"
Add New Evaluation Frameworks#
You can add more evaluation methods by installing additional NVIDIA Eval Factory packages.
Prerequisites: An OpenAI-compatible model endpoint must be running and accessible.
For each package:
Install the required package.
Export any required environment variables (if specified).
Run the evaluation of your choice.
Below you can find examples for enabling and launching evaluations for different packages.
These examples demonstrate functionality using a subset of samples.
To run the evaluation on the entire dataset, remove the "limit_samples"
parameter.
Install the nvidia-bfcl package:
pip install nvidia-bfcl==25.7.1
Run the evaluation:
Install the nvidia-eval-factory-garak package:
pip install nvidia-eval-factory-garak==25.6
Run the evaluation:
Install the nvidia-bigcode-eval package:
pip install nvidia-bigcode-eval==25.6
Run the evaluation:
Install the nvidia-simple-evals package:
pip install nvidia-simple-evals==25.7.1
In the example below, we use the AIME_2025
task, which follows the llm-as-a-judge approach for checking the output correctness.
By default, Llama 3.3 70B NVIDIA NIM is used for judging.
To run evaluation, set your build.nvidia.com API key as the
JUDGE_API_KEY
variable:export JUDGE_API_KEY=your-api-key-here
To customize the judge setting, see the instructions for NVIDIA Eval Factory package.
Run the evaluation:
Install the nvidia-safety-harness package:
pip install nvidia-safety-harness==25.6
Deploy the judge model
In the example below, we use the
aegis_v2
task, which requires the Llama 3.1 NemoGuard 8B ContentSafety model to assess your model’s responses.The model is available through NVIDIA NIM. See the instructions on deploying the judge model.
If you set a gated judge endpoint up, you must export your API key as the
JUDGE_API_KEY
variable:export JUDGE_API_KEY=...
To access the evaluation dataset, you must authenticate with the Hugging Face Hub.
Run the evaluation:
Make sure to modify the judge configuration in the provided snippet to match your Llama 3.1 NemoGuard 8B ContentSafety endpoint:
params={ "extra": { "judge": { "model_id": "my-llama-3.1-nemoguard-8b-content-safety-endpoint", "url": "http://my-hostname:1234/v1/completions", } } }