Safety Harness Evaluations#

Safety harness supports 2 academic benchmarks for Language Models (LMs). Use this evaluation type to benchmark the model’s vulnerability to generate harmful, biased, misleading content, and susceptibility to malicious attacks.

Tip

Want to experiment first? You can try these benchmarks using the open-source NeMo Evaluator SDK before deploying the microservice. The SDK provides a lightweight way to test evaluation workflows locally.

Prerequisites#

A Hugging Face account token. A valid Hugging Face token is required to access the benchmark dataset and base model tokenizer.
A content safety model as a judge. The safety harness supports Llama Nemotron Safety Guard V2 and WildGuard.

Example Job Execution#

You can execute an Evaluation Job using either the Python SDK or cURL as follows, replacing <my-eval-config> with configs shown on this page:

Note

See Job Target and Configuration Matrix for details on target / config compatibility.

Python SDK v2

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http(s)://<your evaluator service endpoint>"
)
job = client.v2.evaluation.jobs.create(
    spec={
        "target": {
            "type": "model",
            "name": "my-target-dataset-1",
            "namespace": "my-organization",
            "model": {
                "api_endpoint": {
                    # Replace NIM_BASE_URL with your specific deployment
                    "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                    "model_id": "meta/llama-3.1-8b-instruct"
                }
            },
        },
        "config": <my-eval-config>
    }
)

cURL v2

curl -X "POST" "$EVALUATOR_BASE_URL/v2/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "spec": {
            "target": {
                "type": "model",
                "name": "my-target-dataset-1",
                "namespace": "my-organization",
                "model": {
                    "api_endpoint": {
                        # Replace NIM_BASE_URL with your specific deployment
                        "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                        "model_id": "meta/llama-3.1-8b-instruct"
                    }
                }
            },
            "config": <my-eval-config>
        }
    }'

Python SDK v1

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http(s)://<your evaluator service endpoint>"
)
job = client.evaluation.jobs.create(
    namespace="my-organization",
    target={
        "type": "model",
        "namespace": "my-organization",
        "model": {
            "api_endpoint": {
                # Replace NIM_BASE_URL with your specific deployment
                "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                "model_id": "meta/llama-3.1-8b-instruct"
            }
        },
    },
    config=<my-eval-config>
)

cURL v1

curl -X "POST" "$EVALUATOR_BASE_URL/v1/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "namespace": "my-organization",
        "target": {
            "type": "model",
            "namespace": "my-organization",
            "model": {
                "api_endpoint": {
                    # Replace NIM_BASE_URL with your specific deployment
                    "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                    "model_id": "meta/llama-3.1-8b-instruct"
                }
            }
        },
        "config": <my-eval-config>
    }'

For a full example, see Run an Academic LM Harness Eval

Evaluation Types#

Example Safety Harness types#
Type	Safety Judge Model	Description
Nemotron Content Safety V2	meta/llama-3.1-nemoguard-8b-content-safety	Evaluates model safety risks based on 12 top-level hazard categories.
WildGuard	allenai/wildguard	Evaluates model safety risks based on the following top-level categories: privacy, misinformation, harmful language, and malicious uses.

Nemotron Content Safety V2#

Evaluate the model using Nemotron Content Safety V2 dataset and Llama Nemotron Safety Guard V2 as a safety judge. Related paper: AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails

A Hugging Face token is required to access the dataset and meta-llama/Llama-3.1-8B-Instruct for the base model tokenizer of Nemotron Safety Guard.

Config

{
  "type": "aegis_v2",
  "params": {
    // Target model parameters
    "parallelism": 1,
    "request_timeout": 120,
    "max_retries": 2,
    "extra": {
      "hf_token": "<hf-token>",
      "judge": {
        "model": {
          "api_endpoint": {
            "url": "<nim-base-url>/v1/completions",
            "model_id": "meta/llama-3.1-nemoguard-8b-content-safety",
            "api_key": "<OPTIONAL>"
          }
        },
        // Judge model parameters
        "params": {
          "parallelism": 10,
          "request_timeout": 60,
          "max_retries": 2
        }
      }
    }
  }
}

Result

{
  "tasks": {
    "aegis_v2": {
      "metrics": {
        "safety_rate": {
          "scores": {
            "safety_rate": {
              "value": 0.85,
              "stats": {}
            }
          }
        }
      }
    }
  }
}

Deploy Judge Model

Evaluating with content safety dataset requires Llama Nemotron Safety Guard V2 for the judge model. Run the following command to deploy the model as a NVIDIA Inference Microservice (NIM) using NeMo Deployment Management Service:

curl http://nemo.test/v1/deployment/model-deployments \
  -H 'Content-Type: application/json' \
  -d @nemotron-safety-guard.json

{
  "name": "llama-3.1-nemoguard-8b-content-safety",
  "namespace": "meta",
  "config": {
    "model": "meta/llama-3.1-nemoguard-8b-content-safety",
    "nim_deployment": {
      "disable_lora_support": true,
      "additional_envs": {
        "NIM_GUIDED_DECODING_BACKEND": "outlines"
      },
      "gpu": 1,
      "image_name": "nvcr.io/nim/nvidia/llama-3.1-nemoguard-8b-content-safety",
      "image_tag": "1.10.1"
    }
  }
}

WildGuard#

Evaluate the model using the WildGuardMix dataset and the WildGuard model as a safety judge. Related paper: WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

A Hugging Face token is required to access the dataset and mistralai/Mistral-7B-v0.3 for the base model tokenizer of WildGuard.

Config

{
  "type": "wildguard",
  "params": {
    // Target model parameters
    "parallelism": 1,
    "request_timeout": 120,
    "max_retries": 2,
    "extra": {
      "hf_token": "<hf-token>",
      "judge": {
        "model": {
          "api_endpoint": {
            "url": "<deployed-wildguard-url>/v1/completions",
            "model_id": "allenai/wildguard",
            "api_key": "<OPTIONAL>"
          }
        },
        // Judge model parameters
        "params": {
          "parallelism": 10,
          "request_timeout": 60,
          "max_retries": 2
        }
      }
    }
  }
}

Result

{
  "tasks": {
    "wildguard": {
      "metrics": {
        "safety_rate": {
          "scores": {
            "safety_rate": {
              "value": 0.85,
              "stats": {}
            }
          }
        }
      }
    }
  }
}

Deploy Judge Model

Evaluating with WildGuard requires the WildGuard judge model. Below are examples of deploying WildGuard using the Docker and Kubernetes.

Docker

Run WildGuard safety judge model with the vllm/vllm-openai Docker container. Visit vLLM Using Docker for more information.

export HF_TOKEN=<hf-token>

docker run -it --gpus all \
  -p 8001:8000 \
  -e HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} \
  vllm/vllm-openai:v0.8.5 \
  --model allenai/wildguard

Kubernetes

The WildGuard safety judge model can be deployed to Kubernetes with the vllm/vllm-openai Docker container. Visit vLLM Using Kubernetes for more information.

Run the command below to create a secret for your Hugging Face API key and deploy the model to your Kubernetes cluster.

export HF_TOKEN=<hf-token>

kubectl create secret generic hf-token-secret --from-literal=token=${HF_TOKEN}
kubectl apply -f model.yaml

apiVersion: v1
kind: Pod
metadata:
  name: allenai-wildguard
  labels:
    app: allenai-wildguard
spec:
  volumes:
  # vLLM needs to access the host's shared memory for tensor parallel inference.
  - name: shm
    emptyDir:
      medium: Memory
      sizeLimit: "2Gi"
  containers:
  - name: model
    image: vllm/vllm-openai:v0.8.5
    command: ["/bin/sh", "-c"]
    args: [
      "vllm serve allenai/wildguard --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
    ]
    env:
    - name: HUGGING_FACE_HUB_TOKEN
      valueFrom:
        secretKeyRef:
          name: hf-token-secret
          key: token
    - name: USE_FASTSAFETENSOR
      value: "true"
    ports:
    - containerPort: 8000
    resources:
      limits:
        nvidia.com/gpu: 1
    securityContext:
      privileged: true
    volumeMounts:
    - name: shm
      mountPath: /dev/shm
    livenessProbe:
      httpGet:
        path: /health
        port: 8000
      initialDelaySeconds: 180
      periodSeconds: 10
    readinessProbe:
      httpGet:
        path: /health
        port: 8000
      initialDelaySeconds: 180
      periodSeconds: 5

Parameters#

Request Parameters#

These parameters control how requests are made to the target model or judge model:

Name	Description	Type	Default
`max_retries`	Maximum number of retries for failed model inference requests.	Integer	target model: 5 judge model: 16
`parallelism`	Number of parallel requests to improve throughput.	Integer	target model: 8 judge model: 32
`request_timeout`	Timeout in seconds for each request.	Integer	target model: 30 judge model: 60
`limit_samples`	Limit the number of samples to evaluate. Useful for testing. Not available as a judge model parameter.	Integer	`null` (all samples)

Target Model Parameters#

These parameters control the target model’s generation behavior:

Name	Description	Type	Default	Valid Values
`temperature`	Sampling temperature for generation.	Float	0.6	`0.0–2.0`
`top_p`	Nucleus sampling parameter.	Float	0.95	`0.01–1.0`
`max_tokens`	Maximum number of tokens to generate.	Integer	6144	—

Extra Parameters#

Set these parameters in the params.extra section:

Name	Description	Type	Valid Values
`hf_token`	Hugging Face token is required for accessing datasets and tokenizers.	String	Valid Hugging Face token
`judge.model`	Required to specify the judge model to evaluate the target model’s output for safety risks.	String or Dictionary	`"meta/llama-3.1-nemoguard-8b-content-safety"` or `{"api_endpoint": {"model_id": "meta/llama-3.1-nemoguard-8b-content-safety", "url": "<nim-url>/v1/completions"}}`
`judge.params`	Request parameters control how requests are made to the judge model.	Request Parameters	`{"parallelism": 2}`

Metrics#

Core Supported Metrics in LM Evaluation Harness#
Metric Name	Description	Value Range	Notes
`safety_rate`	Fraction of model output is categorized as safe.	`0.0–1.0`	Requires safety judge

Troubleshooting#

View Troubleshooting NeMo Evaluator for general troubleshooting steps of failed evaluation jobs. This section covers common issues for the safety harness.

Hugging Face Error#

Evaluations with safety harness requires Hugging Face access to the respective dataset and model tokenizer. If your job fails with the following errors, visit https://huggingface.co/ and log in to request access to the dataset or model.

datasets.exceptions.DatasetNotFoundError: Dataset 'allenai/wildguardmix' is a gated dataset on the Hub. Visit the dataset page at https://huggingface.co/datasets/allenai/wildguardmix to ask for access.

GatedRepoError: 403 Client Error.

Cannot access gated repo for url https://huggingface.co/<model>/resolve/main/tokenizer_config.json.
Your request to access model <model> is awaiting a review from the repo authors.

Incompatible Judge Model#

Using an unsupported judge model results in a job error. The aegis_v2 evaluation type requires Llama Nemotron Safety Guard V2 judge and wildguard evaluation type requires allenai/wildguard judge. KeyError is an example error for the wrong judge model like the following error.

Metrics calculated


        Evaluation Metrics         
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Safety Category ┃ Average Count ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│           ERROR │           5.0 │
└─────────────────┴───────────────┘

...

Subprocess finished with return code: 0
{'ERROR': 5.0}
Traceback (most recent call last):
...
"/usr/local/lib/python3.10/site-packages/core_evals/safety_eval/__init__.py", line 14, in parse_output
    return parse_output(output_dir)
  File "/usr/local/lib/python3.10/site-packages/core_evals/safety_eval/output.py", line 16, in parse_output
    safety_rate = data['safe'] / sum(data.values())
KeyError: 'safe'

Unexpected Reasoning Traces#

Safety evaluations do not support reasoning traces and may result in the job error below.

ERROR    There are  at least 2 MUT (model under test) responses that start with <think>. Reasoning traces should not be evaluated. Exiting.

If the target model outputs reasoning traces like <think>reasoning context</think>, configure the target model prompt.reasoning_params.end_token to only evaluate on the final thought. Consider specifying config.params.max_tokens to a reasonable limit for the model’s chain of thought to conclude with the expected reasoning end token in order for the reasoning context to be properly omitted for evaluation.

Additionally, if you are encountering this error, it could be caused by the model exceeding the its token limit resulting in the full response being consumed by the model thinking. These results can be dropped by setting the config.params.include_if_not_finished parameter.

{
  "target": {
    "type": "model",
    "model": {
      "api_endpoint": {},
      "prompt": {
        "reasoning_params": {
          "end_token": "</think>",
          "include_if_not_finished": false
        }
      }
    }
  }
}