Installation Guide#

NeMo Evaluator provides multiple installation paths depending on your needs. Choose the approach that best fits your use case.

Choose Your Installation Path#

Table 3 Installation Path Comparison#
Installation Path	Best For	Key Features
NeMo Evaluator Launcher (Recommended)	Most users who want unified CLI and orchestration across backends	• Unified CLI for 100+ benchmarks • Multi-backend execution (local, Slurm, cloud) • Built-in result export to MLflow, W&B, etc. • Configuration management with examples
NeMo Evaluator Core	Developers building custom evaluation pipelines	• Programmatic Python API • Direct container access • Custom framework integration • Advanced adapter configuration
Container Direct	Users who prefer container-based workflows	• Pre-built NGC evaluation containers • Guaranteed reproducibility • No local installation required • Isolated evaluation environments

Prerequisites#

System Requirements#

Python 3.10 or higher (supports 3.10, 3.11, 3.12, and 3.13)
CUDA-compatible GPU(s) (tested on RTX A6000, A100, H100)
Docker (for container-based workflows)

Recommended Environment#

Python 3.12
PyTorch 2.7
CUDA 12.9
Ubuntu 24.04

Installation Methods#

Use pip#

Launcher (Recommended)

Install NeMo Evaluator Launcher for unified CLI and orchestration:

# Create and activate virtual environment
python3 -m venv nemo-eval-env
source nemo-eval-env/bin/activate

# Install launcher with all exporters (recommended)
pip install nemo-evaluator-launcher[all]

Quick verification:

# Verify installation
nv-eval --version

# Test basic functionality - list available tasks
nv-eval ls tasks | head -10

Core Library

Install NeMo Evaluator Core for programmatic access:

# Create and activate virtual environment
python3 -m venv nemo-eval-env
source nemo-eval-env/bin/activate

# Install core library with dependencies
pip install torch==2.7.0 setuptools pybind11 wheel_stub  # Required for TE
pip install --no-build-isolation nemo-evaluator

# Install evaluation frameworks
pip install nvidia-simple-evals nvidia-lm-eval

Quick verification:

print("✓ Core library installed successfully")
print("✓ Adapter system ready")

NGC Containers

Use pre-built evaluation containers from NVIDIA NGC for guaranteed reproducibility:

# Pull evaluation containers (no local installation needed)
docker pull nvcr.io/nvidia/eval-factory/simple-evals:25.08.1
docker pull nvcr.io/nvidia/eval-factory/lm-evaluation-harness:25.08.1
docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:25.08.1

# Run container interactively
docker run --rm -it --gpus all \
    -v $(pwd)/results:/workspace/results \
    nvcr.io/nvidia/eval-factory/simple-evals:25.08.1 bash

# Or run evaluation directly
docker run --rm --gpus all \
    -v $(pwd)/results:/workspace/results \
    -e MY_API_KEY=your-api-key \
    nvcr.io/nvidia/eval-factory/simple-evals:25.08.1 \
    eval-factory run_eval \
        --eval_type mmlu_pro \
        --model_url https://integrate.api.nvidia.com/v1/chat/completions \
        --model_id meta/llama-3.1-8b-instruct \
        --api_key_name MY_API_KEY \
        --output_dir /workspace/results

Quick verification:

# Test container access
docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:25.08.1 \
    eval-factory ls | head -5
echo " Container access verified"

Add New Evaluation Frameworks#

You can add more evaluation methods by installing additional NVIDIA Eval Factory packages.

Prerequisites: An OpenAI-compatible model endpoint must be running and accessible.

For each package:

Install the required package.
Export any required environment variables (if specified).
Run the evaluation of your choice.

Below you can find examples for enabling and launching evaluations for different packages. These examples demonstrate functionality using a subset of samples. To run the evaluation on the entire dataset, remove the "limit_samples" parameter.

BFCL

Install the nvidia-bfcl package:
```
pip install nvidia-bfcl==25.7.1
```
Run the evaluation:

garak

Install the nvidia-eval-factory-garak package:

pip install nvidia-eval-factory-garak==25.6

Run the evaluation:

BigCode

Install the nvidia-bigcode-eval package:
```
pip install nvidia-bigcode-eval==25.6
```
Run the evaluation:

simple-evals

Install the nvidia-simple-evals package:

pip install nvidia-simple-evals==25.7.1

In the example below, we use the AIME_2025 task, which follows the llm-as-a-judge approach for checking the output correctness. By default, Llama 3.3 70B NVIDIA NIM is used for judging.

To run evaluation, set your build.nvidia.com API key as the JUDGE_API_KEY variable:
```
export JUDGE_API_KEY=your-api-key-here
```

To customize the judge setting, see the instructions for NVIDIA Eval Factory package.

Run the evaluation:

safety-harness

Install the nvidia-safety-harness package:

pip install nvidia-safety-harness==25.6

Deploy the judge model

In the example below, we use the aegis_v2 task, which requires the Llama 3.1 NemoGuard 8B ContentSafety model to assess your model’s responses.

The model is available through NVIDIA NIM. See the instructions on deploying the judge model.

If you set a gated judge endpoint up, you must export your API key as the JUDGE_API_KEY variable:
```
export JUDGE_API_KEY=...
```
To access the evaluation dataset, you must authenticate with the Hugging Face Hub.

Run the evaluation:

Make sure to modify the judge configuration in the provided snippet to match your Llama 3.1 NemoGuard 8B ContentSafety endpoint:

    params={
        "extra": {
            "judge": {
                "model_id": "my-llama-3.1-nemoguard-8b-content-safety-endpoint",
                "url": "http://my-hostname:1234/v1/completions",
            }
        }
    }