Add On-Demand Evaluation Packages#

This guide explains how to extend the NeMo evaluation environment by adding optional NVIDIA Eval Factory packages. It walks through installation, setup, and execution steps for various packages such as BFCL, garak, BigCode, simple-evals, and safety-harness, each enabling specialized model assessments.

The following diagram illustrates the architecture of the Eval ecosystem, showing how different evaluation harnesses integrate with NeMo Eval:

┌──────────────────────┐
│                      │
│lm-evaluation-harness ◄─────────┐
│                      │         │      ┌─────────────────────┐
└──────────────────────┘         │      │                     │
                                 │      │ NVIDIA Eval Factory │
┌──────────────────────┐         │      │ =================== │
│                      ◄─────────┼──────┼                     ◄───┐      ┌────────────────────┐
│     simple-evals     │    packaging   │ unified interface   │ model    │                    │
│                      │    of eval     │ for LLM evaluation  │ querying │     NeMo Eval      │
└──────────────────────┘    harnesses   │                     │   └──────┼ ================== │
                                 │      └─────────────────────┘          │                    │
  .                              │                                       │   LLM evaluation   │
  .                   ◄──────────┼                                ┌──────┼ with server-client │
  .                              │      ┌────────────────────┐  model    │      approach      │
                                 │      │                    │  serving  │                    │
                                 │      │ NeMo Export-Deploy │    │      └────────────────────┘
┌──────────────────────┐         │      │ ================== │    │
|                      │         │      │                    │    │
│        garak         ◄─────────┘      │  model deployment  ◄────┘
│                      │                │  for NeMo and HF   │
└──────────────────────┘                │                    │
                                        └────────────────────┘

Add New Evaluation Frameworks#

The NeMo Framework Docker image comes with nvidia-lm-eval pre-installed. However, you can add more evaluation methods by installing additional NVIDIA Eval Factory packages.

For each package, follow these steps:

Install the required package.
Deploy your model:

# File deploy.py

from nemo_eval.api import deploy

CHECKPOINT_PATH = "/checkpoints/llama-3_2-1b-instruct_v2.0"

if __name__ == "__main__":
    deploy(
        nemo_checkpoint=CHECKPOINT_PATH,
        max_input_len=8192,
    )

Run the deployment in the background:

python deploy.py

Make sure to open two separate terminals within the same container for executing the deployment and evaluation.

(Optional) Export the required environment variables.
Run the evalution of your choice.

Below you can find examples for enabling and launching evaluations for different packages. Note that all example use only a subset of samples. To run the evaluation on the whole dataset, remove the "limit_samples" parameter.

Enable BFCL#

First, install the nvidia-bfcl package:

pip install nvidia-bfcl==25.6

Run the evaluation:

from nemo_eval.api import evaluate
from nemo_eval.utils.api import EvaluationConfig, EvaluationTarget

model_name = "megatron_model"
chat_url = "http://0.0.0.0:8080/v1/chat/completions/"


target_config = EvaluationTarget(
    api_endpoint={
        "url": chat_url,
        "type": "chat",
    }
)
eval_config = EvaluationConfig(type="bfclv3_ast", output_dir="/results/", params={"limit_samples": 10})


results = evaluate(target_cfg=target_config, eval_cfg=eval_config)


print(results)

Enable garak#

Install the nvidia-eval-factory-garak package:

pip install nvidia-eval-factory-garak==25.6

Run the evaluation:

from nemo_eval.api import evaluate
from nemo_eval.utils.api import EvaluationConfig, EvaluationTarget

model_name = "megatron_model"
chat_url = "http://0.0.0.0:8080/v1/chat/completions/"


target_config = EvaluationTarget(
    api_endpoint={
        "url": chat_url,
        "type": "chat",
    }
)
eval_config = EvaluationConfig(
    type="garak",
    output_dir="/results/",
    params={"extra": {"probes": "ansiescape.AnsiEscaped"}},
)


results = evaluate(target_cfg=target_config, eval_cfg=eval_config)


print(results)

Enable BigCode#

Install the nvidia-bigcode-eval package:

pip install nvidia-bigcode-eval==25.6

Run the evaluation:

from nemo_eval.api import evaluate
from nemo_eval.utils.api import EvaluationConfig, EvaluationTarget

model_name = "megatron_model"
chat_url = "http://0.0.0.0:8080/v1/chat/completions/"


target_config = EvaluationTarget(
    api_endpoint={
        "url": chat_url,
        "type": "chat",
    }
)
eval_config = EvaluationConfig(type="mbpp", output_dir="/results/", params={"limit_samples": 10})


results = evaluate(target_cfg=target_config, eval_cfg=eval_config)


print(results)

Enable simple-evals#

Install the nvidia-simple-evals package:

pip install nvidia-simple-evals==25.6

In the example below, we use the AIME_2025 task, which follows the llm-as-a-judge approach for checking the output correctness. By default, Llama 3.3 70B NVIDIA NIM is used for judging.

To run evaluation, set your build.nvidia.com API key as the JUDGE_API_KEY variable:

export JUDGE_API_KEY=...

To customize the judge setting, see the instructions for NVIDIA Eval Factory package.

Run the evaluation:

from nemo_eval.api import evaluate
from nemo_eval.utils.api import EvaluationConfig, EvaluationTarget

model_name = "megatron_model"
chat_url = "http://0.0.0.0:8080/v1/chat/completions/"


target_config = EvaluationTarget(
    api_endpoint={
        "url": chat_url,
        "type": "chat",
    }
)
eval_config = EvaluationConfig(
    type="AIME_2025",
    output_dir="/results/",
    params={"limit_samples": 10},
)


results = evaluate(target_cfg=target_config, eval_cfg=eval_config)


print(results)

Enable safety-harness#

Install the nvidia-safety-harness package:

pip install nvidia-safety-harness==25.6

Deploy the judge model

In the example below, we use the aegis_v2 task, which requires the Llama 3.1 NemoGuard 8B ContentSafety model to assess your model’s responses.

The model is available through NVIDIA NIM. See the instructions on deploying the judge model.

If you set a gated judge endpoint up, you must export your API key as the JUDGE_API_KEY variable:

export JUDGE_API_KEY=...

To access the evaluation dataset, you must authenticate with the Hugging Face Hub.
Run the evaluation:

from nemo_eval.api import evaluate
from nemo_eval.utils.api import EvaluationConfig, EvaluationTarget

model_name = "megatron_model"
chat_url = "http://0.0.0.0:8080/v1/chat/completions/"


target_config = EvaluationTarget(
    api_endpoint={
        "url": chat_url,
        "type": "chat",
    }
)
eval_config = EvaluationConfig(
    type="aegis_v2",
    output_dir="/results/",
    params={
        "limit_samples": 10,
        "extra": {
            "judge": {
                "model_id": "llama-nemotron-safety-guard-v2",
                "url": "http://0.0.0.0:9000/v1/completions",
            }
        },
    },
)


results = evaluate(target_cfg=target_config, eval_cfg=eval_config)


print(results)

Make sure to modify the judge configuration in the provided snippet to match your Llama 3.1 NemoGuard 8B ContentSafety endoint:

    params={
        "extra": {
            "judge": {
                "model_id": "my-llama-3.1-nemoguard-8b-content-safety-endpoint",
                "url": "http://my-hostname:1234/v1/completions",
            }
        }
    }