Add Evaluation Packages to NeMo Framework#

The NeMo Framework Docker image comes with nvidia-lm-eval pre-installed. However, you can add more evaluation methods by installing additional NeMo Evaluator packages.

For each package, follow these steps:

Install the required package.
Deploy your model:

CHECKPOINT_PATH="/checkpoints/llama-3_2-1b-instruct_v2.0"

python \
  /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
  --nemo_checkpoint ${CHECKPOINT_PATH} \
  --model_id megatron_model \
  --port 8080 \
  --host 0.0.0.0

Wait for the server to get started and ready for accepting requests:

from nemo_evaluator.api import check_endpoint
check_endpoint(
    endpoint_url="http://0.0.0.0:8080/v1/completions/",
    endpoint_type="completions",
    model_name="megatron_model",
)

Make sure to open two separate terminals within the same container for executing the deployment and evaluation.

(Optional) Export the required environment variables.
Run the evalution of your choice.

Below you can find examples for enabling and launching evaluations for different packages.

Tip

All examples below use only a subset of samples. To run the evaluation on the whole dataset, remove the limit_samples parameter.

Enable On-Demand Evaluation Packages#

Note

If multiple harnesses are installed in your environment and they define a task with the same name, you must use the <harness>.<task> format to avoid ambiguity. For example:

eval_config = EvaluationConfig(type="lm-evaluation-harness.mmlu")
eval_config = EvaluationConfig(type="simple_evals.mmlu")

BFCL

Install the nvidia-bfcl package:

pip install nvidia-bfcl

Run the evaluation:

from nemo_evaluator.api import evaluate
from nemo_evaluator.api.api_dataclasses import (
    ApiEndpoint,
    ConfigParams,
    EndpointType,
    EvaluationConfig,
    EvaluationTarget,
)

model_name = "megatron_model"
chat_url = "http://0.0.0.0:8080/v1/chat/completions/"


target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(url=chat_url, type=EndpointType.CHAT, model_id=model_name)
)
eval_config = EvaluationConfig(
    type="bfclv3_ast_prompting",
    output_dir="/results/",
    params=ConfigParams(limit_samples=10, temperature=0, top_p=0, parallelism=1),
)

results = evaluate(target_cfg=target_config, eval_cfg=eval_config)


print(results)

garak

Install the nvidia-eval-factory-garak package:

pip install nvidia-eval-factory-garak

Run the evaluation:

from nemo_evaluator.api import evaluate
from nemo_evaluator.api.api_dataclasses import (
    ApiEndpoint,
    ConfigParams,
    EndpointType,
    EvaluationConfig,
    EvaluationTarget,
)

model_name = "megatron_model"
chat_url = "http://0.0.0.0:8080/v1/chat/completions/"

target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(url=chat_url, type=EndpointType.CHAT, model_id=model_name)
)
eval_config = EvaluationConfig(
    type="garak",
    output_dir="/results/",
    params=ConfigParams(
        limit_samples=10,
        temperature=0,
        top_p=0,
        parallelism=1,
        extra={"probes": "ansiescape.AnsiEscaped"},  # remove to run with all probes
    ),
)


results = evaluate(target_cfg=target_config, eval_cfg=eval_config)


print(results)

BigCode

Install the nvidia-bigcode-eval package:

pip install nvidia-bigcode-eval

Run the evaluation:

from nemo_evaluator.api import evaluate
from nemo_evaluator.api.api_dataclasses import (
    ApiEndpoint,
    ConfigParams,
    EndpointType,
    EvaluationConfig,
    EvaluationTarget,
)

model_name = "megatron_model"
chat_url = "http://0.0.0.0:8080/v1/chat/completions/"


target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(url=chat_url, type=EndpointType.CHAT, model_id=model_name)
)
eval_config = EvaluationConfig(
    type="mbpp",
    output_dir="/results/",
    params=ConfigParams(limit_samples=10, temperature=0, top_p=0, parallelism=1),
)


results = evaluate(target_cfg=target_config, eval_cfg=eval_config)


print(results)

simple-evals

Install the nvidia-simple-evals package:

pip install nvidia-simple-evals

In the example below, we use the AIME_2025 task, which follows the llm-as-a-judge approach for checking the output correctness. By default, Llama 3.3 70B NVIDIA NIM is used for judging.

To run evaluation, set your build.nvidia.com API key as the JUDGE_API_KEY variable:

export JUDGE_API_KEY=...

To customize the judge setting, see the instructions for NVIDIA Eval Factory package.

Run the evaluation:

from nemo_evaluator.api import evaluate
from nemo_evaluator.api.api_dataclasses import (
    ApiEndpoint,
    ConfigParams,
    EndpointType,
    EvaluationConfig,
    EvaluationTarget,
)

model_name = "megatron_model"
chat_url = "http://0.0.0.0:8080/v1/chat/completions/"


target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(url=chat_url, type=EndpointType.CHAT, model_id=model_name)
)

eval_config = EvaluationConfig(
    type="AIME_2025",
    output_dir="/results/",
    params=ConfigParams(limit_samples=10, temperature=0, top_p=0, parallelism=1),
)
results = evaluate(target_cfg=target_config, eval_cfg=eval_config)


print(results)

safety-harness

Install the nvidia-safety-harness package:

pip install nvidia-safety-harness

Deploy the judge model.

In the example below, we use the aegis_v2 task, which requires the Llama 3.1 NemoGuard 8B ContentSafety model to assess your model’s responses.

The model is available through NVIDIA NIM. See the instructions on deploying the judge model.

If you set up a gated judge endpoint, you must export your API key as the JUDGE_API_KEY variable:

export JUDGE_API_KEY=...

To access the evaluation dataset, you must authenticate with the Hugging Face Hub.
Run the evaluation:

from nemo_evaluator.api import evaluate
from nemo_evaluator.api.api_dataclasses import (
    ApiEndpoint,
    ConfigParams,
    EndpointType,
    EvaluationConfig,
    EvaluationTarget,
)

model_name = "megatron_model"
chat_url = "http://0.0.0.0:8080/v1/chat/completions/"


target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(url=chat_url, type=EndpointType.CHAT, model_id=model_name)
)
eval_config = EvaluationConfig(
    type="aegis_v2",
    output_dir="/results/",
    params=ConfigParams(
        limit_samples=10,
        temperature=0,
        top_p=0,
        parallelism=1,
        extra={
            "judge": {  # adjust to your judge endpoint
                "model_id": "llama-nemotron-safety-guard-v2",
                "url": "http://0.0.0.0:9000/v1/completions",
            }
        },
    ),
)


results = evaluate(target_cfg=target_config, eval_cfg=eval_config)


print(results)

Make sure to modify the judge configuration in the provided snippet to match your Llama 3.1 NemoGuard 8B ContentSafety endoint.