Safety Harness Evaluations#
Safety harness supports 2 academic benchmarks for Language Models (LMs). Use this evaluation type to benchmark the model’s vulnerability to generate harmful, biased, misleading content, and susceptibility to malicious attacks.
Prerequisites#
A model target. Refer to LLM Model Endpoint for more information.
A Hugging Face account token. A valid Hugging Face token is required to access the benchmark dataset and base model tokenizer.
A content safety model as a judge. The safety harness supports Llama Nemotron Safety Guard V2 and WildGuard.
Evaluation Types#
Type |
Safety Judge Model |
Description |
---|---|---|
Nemotron Content Safety V2 |
Evaluates model safety risks based on 12 top-level hazard categories. |
|
WildGuard |
Evaluates model safety risks based on the following top-level categories: privacy, misinformation, harmful language, and malicious uses. |
Nemotron Content Safety V2#
Evaluate the model using Nemotron Content Safety V2 dataset and Llama Nemotron Safety Guard V2 as a safety judge. Related paper: AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails
A Hugging Face token is required to access the dataset and meta-llama/Llama-3.1-8B-Instruct for the base model tokenizer of Nemotron Safety Guard.
{
"type": "aegis_v2",
"params": {
// Target model parameters
"parallelism": 1,
"request_timeout": 120,
"max_retries": 2,
"extra": {
"hf_token": "<hf-token>",
"judge": {
"model": {
"api_endpoint": {
"url": "<nim-base-url>/v1/completions",
"model_id": "meta/llama-3.1-nemoguard-8b-content-safety",
"api_key": "<OPTIONAL>"
}
},
// Judge model parameters
"params": {
"parallelism": 10,
"request_timeout": 60,
"max_retries": 2
}
}
}
}
}
{
"tasks": {
"aegis_v2": {
"metrics": {
"safety_rate": {
"scores": {
"safety_rate": {
"value": 0.85,
"stats": {}
}
}
}
}
}
}
}
Evaluating with content safety dataset requires Llama Nemotron Safety Guard V2 for the judge model. Run the following command to deploy the model as a NVIDIA Inference Microservice (NIM) using NeMo Deployment Management Service:
curl http://nemo.test/v1/deployment/model-deployments \
-H 'Content-Type: application/json' \
-d @nemotron-safety-guard.json
{
"name": "llama-3.1-nemoguard-8b-content-safety",
"namespace": "meta",
"config": {
"model": "meta/llama-3.1-nemoguard-8b-content-safety",
"nim_deployment": {
"disable_lora_support": true,
"additional_envs": {
"NIM_GUIDED_DECODING_BACKEND": "outlines"
},
"gpu": 1,
"image_name": "nvcr.io/nim/nvidia/llama-3.1-nemoguard-8b-content-safety",
"image_tag": "1.10.1"
}
}
}
WildGuard#
Evaluate the model using the WildGuardMix dataset and the WildGuard model as a safety judge. Related paper: WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
A Hugging Face token is required to access the dataset and mistralai/Mistral-7B-v0.3 for the base model tokenizer of WildGuard.
{
"type": "wildguard",
"params": {
// Target model parameters
"parallelism": 1,
"request_timeout": 120,
"max_retries": 2,
"extra": {
"hf_token": "<hf-token>",
"judge": {
"model": {
"api_endpoint": {
"url": "<deployed-wildguard-url>/v1/completions",
"model_id": "allenai/wildguard",
"api_key": "<OPTIONAL>"
}
},
// Judge model parameters
"params": {
"parallelism": 10,
"request_timeout": 60,
"max_retries": 2
}
}
}
}
}
{
"tasks": {
"wildguard": {
"metrics": {
"safety_rate": {
"scores": {
"safety_rate": {
"value": 0.85,
"stats": {}
}
}
}
}
}
}
}
Evaluating with WildGuard requires the WildGuard judge model. Below are examples of deploying WildGuard using the Docker and Kubernetes.
Docker
Run WildGuard safety judge model with the vllm/vllm-openai
Docker container. Visit vLLM Using Docker for more information.
export HF_TOKEN=<hf-token>
docker run -it --gpus all \
-p 8001:8000 \
-e HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} \
vllm/vllm-openai:v0.8.5 \
--model allenai/wildguard
Kubernetes
The WildGuard safety judge model can be deployed to Kubernetes with the vllm/vllm-openai
Docker container. Visit vLLM Using Kubernetes for more information.
Run the command below to create a secret for your Hugging Face API key and deploy the model to your Kubernetes cluster.
export HF_TOKEN=<hf-token>
kubectl create secret generic hf-token-secret --from-literal=token=${HF_TOKEN}
kubectl apply -f model.yaml
apiVersion: v1
kind: Pod
metadata:
name: allenai-wildguard
labels:
app: allenai-wildguard
spec:
volumes:
# vLLM needs to access the host's shared memory for tensor parallel inference.
- name: shm
emptyDir:
medium: Memory
sizeLimit: "2Gi"
containers:
- name: model
image: vllm/vllm-openai:v0.8.5
command: ["/bin/sh", "-c"]
args: [
"vllm serve allenai/wildguard --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
- name: USE_FASTSAFETENSOR
value: "true"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
securityContext:
privileged: true
volumeMounts:
- name: shm
mountPath: /dev/shm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180
periodSeconds: 5
Parameters#
Request Parameters#
These parameters control how requests are made to the target model or judge model:
Name |
Description |
Type |
Default |
---|---|---|---|
|
Maximum number of retries for failed model inference requests. |
Integer |
target model: 5 |
|
Number of parallel requests to improve throughput. |
Integer |
target model: 8 |
|
Timeout in seconds for each request. |
Integer |
target model: 30 |
|
Limit the number of samples to evaluate. Useful for testing. Not available as a judge model parameter. |
Integer |
|
Target Model Parameters#
These parameters control the target model’s generation behavior:
Name |
Description |
Type |
Default |
Valid Values |
---|---|---|---|---|
|
Sampling temperature for generation. |
Float |
0.6 |
|
|
Nucleus sampling parameter. |
Float |
0.95 |
|
|
Maximum number of tokens to generate. |
Integer |
6144 |
— |
Extra Parameters#
Set these parameters in the params.extra
section:
Name |
Description |
Type |
Valid Values |
---|---|---|---|
|
Hugging Face token is required for accessing datasets and tokenizers. |
String |
Valid Hugging Face token |
|
Required to specify the judge model to evaluate the target model’s output for safety risks. |
String or Dictionary |
|
|
Request parameters control how requests are made to the judge model. |
|
Metrics#
Metric Name |
Description |
Value Range |
Notes |
---|---|---|---|
|
Fraction of model output is categorized as safe. |
|
Requires safety judge |
Troubleshooting#
View Troubleshooting NeMo Evaluator for general troubleshooting steps of failed evaluation jobs. This section covers common issues for the safety harness.
Hugging Face Error#
Evaluations with safety harness requires Hugging Face access to the respective dataset and model tokenizer. If your job fails with the following errors, visit https://huggingface.co/ and log in to request access to the dataset or model.
datasets.exceptions.DatasetNotFoundError: Dataset 'allenai/wildguardmix' is a gated dataset on the Hub. Visit the dataset page at https://huggingface.co/datasets/allenai/wildguardmix to ask for access.
GatedRepoError: 403 Client Error.
Cannot access gated repo for url https://huggingface.co/<model>/resolve/main/tokenizer_config.json.
Your request to access model <model> is awaiting a review from the repo authors.
Incompatible Judge Model#
Using an unsupported judge model results in a job error. The aegis_v2
evaluation type requires Llama Nemotron Safety Guard V2 judge and wildguard
evaluation type requires allenai/wildguard
judge. KeyError
is an example error for the wrong judge model like the following error.
Metrics calculated
Evaluation Metrics
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Safety Category ┃ Average Count ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ ERROR │ 5.0 │
└─────────────────┴───────────────┘
...
Subprocess finished with return code: 0
{'ERROR': 5.0}
Traceback (most recent call last):
...
"/usr/local/lib/python3.10/site-packages/core_evals/safety_eval/__init__.py", line 14, in parse_output
return parse_output(output_dir)
File "/usr/local/lib/python3.10/site-packages/core_evals/safety_eval/output.py", line 16, in parse_output
safety_rate = data['safe'] / sum(data.values())
KeyError: 'safe'
Unexpected Reasoning Traces#
Safety evaluations do not support reasoning traces and may result in the job error below.
ERROR There are at least 2 MUT (model under test) responses that start with <think>. Reasoning traces should not be evaluated. Exiting.
If the target model outputs reasoning traces like <think>reasoning context</think>
, configure the target model prompt.reasoning_params.end_token
to only evaluate on the final thought. Consider specifying config.params.max_tokens
to a reasonable limit for the model’s chain of thought to conclude with the expected reasoning end token in order for the reasoning context to be properly omitted for evaluation.
{
"target": {
"type": "model",
"model": {
"api_endpoint": {},
"prompt": {
"reasoning_params": {
"end_token": "</think>"
}
}
}
}
}