Safety Harness Evaluations#

Safety harness supports 2 academic benchmarks for Language Models (LMs). Use this evaluation type to benchmark the model’s vulnerability to generate harmful, biased, misleading content, and susceptibility to malicious attacks.

Prerequisites#

  • A model target. Refer to LLM Model Endpoint for more information.

  • A Hugging Face account token. A valid Hugging Face token is required to access the benchmark dataset and base model tokenizer.

  • A content safety model as a judge. The safety harness supports Llama Nemotron Safety Guard V2 and WildGuard.


Evaluation Types#

Example Safety Harness types#

Type

Safety Judge Model

Description

Nemotron Content Safety V2

meta/llama-3.1-nemoguard-8b-content-safety

Evaluates model safety risks based on 12 top-level hazard categories.

WildGuard

allenai/wildguard

Evaluates model safety risks based on the following top-level categories: privacy, misinformation, harmful language, and malicious uses.


Nemotron Content Safety V2#

Evaluate the model using Nemotron Content Safety V2 dataset and Llama Nemotron Safety Guard V2 as a safety judge. Related paper: AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails

A Hugging Face token is required to access the dataset and meta-llama/Llama-3.1-8B-Instruct for the base model tokenizer of Nemotron Safety Guard.

{
  "type": "aegis_v2",
  "params": {
    // Target model parameters
    "parallelism": 1,
    "request_timeout": 120,
    "max_retries": 2,
    "extra": {
      "hf_token": "<hf-token>",
      "judge": {
        "model": {
          "api_endpoint": {
            "url": "<nim-base-url>/v1/completions",
            "model_id": "meta/llama-3.1-nemoguard-8b-content-safety",
            "api_key": "<OPTIONAL>"
          }
        },
        // Judge model parameters
        "params": {
          "parallelism": 10,
          "request_timeout": 60,
          "max_retries": 2
        }
      }
    }
  }
}
{
  "tasks": {
    "aegis_v2": {
      "metrics": {
        "safety_rate": {
          "scores": {
            "safety_rate": {
              "value": 0.85,
              "stats": {}
            }
          }
        }
      }
    }
  }
}

Evaluating with content safety dataset requires Llama Nemotron Safety Guard V2 for the judge model. Run the following command to deploy the model as a NVIDIA Inference Microservice (NIM) using NeMo Deployment Management Service:

curl http://nemo.test/v1/deployment/model-deployments \
  -H 'Content-Type: application/json' \
  -d @nemotron-safety-guard.json
{
  "name": "llama-3.1-nemoguard-8b-content-safety",
  "namespace": "meta",
  "config": {
    "model": "meta/llama-3.1-nemoguard-8b-content-safety",
    "nim_deployment": {
      "disable_lora_support": true,
      "additional_envs": {
        "NIM_GUIDED_DECODING_BACKEND": "outlines"
      },
      "gpu": 1,
      "image_name": "nvcr.io/nim/nvidia/llama-3.1-nemoguard-8b-content-safety",
      "image_tag": "1.10.1"
    }
  }
}

WildGuard#

Evaluate the model using the WildGuardMix dataset and the WildGuard model as a safety judge. Related paper: WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

A Hugging Face token is required to access the dataset and mistralai/Mistral-7B-v0.3 for the base model tokenizer of WildGuard.

{
  "type": "wildguard",
  "params": {
    // Target model parameters
    "parallelism": 1,
    "request_timeout": 120,
    "max_retries": 2,
    "extra": {
      "hf_token": "<hf-token>",
      "judge": {
        "model": {
          "api_endpoint": {
            "url": "<deployed-wildguard-url>/v1/completions",
            "model_id": "allenai/wildguard",
            "api_key": "<OPTIONAL>"
          }
        },
        // Judge model parameters
        "params": {
          "parallelism": 10,
          "request_timeout": 60,
          "max_retries": 2
        }
      }
    }
  }
}
{
  "tasks": {
    "wildguard": {
      "metrics": {
        "safety_rate": {
          "scores": {
            "safety_rate": {
              "value": 0.85,
              "stats": {}
            }
          }
        }
      }
    }
  }
}

Evaluating with WildGuard requires the WildGuard judge model. Below are examples of deploying WildGuard using the Docker and Kubernetes.

Docker

Run WildGuard safety judge model with the vllm/vllm-openai Docker container. Visit vLLM Using Docker for more information.

export HF_TOKEN=<hf-token>

docker run -it --gpus all \
  -p 8001:8000 \
  -e HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} \
  vllm/vllm-openai:v0.8.5 \
  --model allenai/wildguard

Kubernetes

The WildGuard safety judge model can be deployed to Kubernetes with the vllm/vllm-openai Docker container. Visit vLLM Using Kubernetes for more information.

Run the command below to create a secret for your Hugging Face API key and deploy the model to your Kubernetes cluster.

export HF_TOKEN=<hf-token>

kubectl create secret generic hf-token-secret --from-literal=token=${HF_TOKEN}
kubectl apply -f model.yaml
apiVersion: v1
kind: Pod
metadata:
  name: allenai-wildguard
  labels:
    app: allenai-wildguard
spec:
  volumes:
  # vLLM needs to access the host's shared memory for tensor parallel inference.
  - name: shm
    emptyDir:
      medium: Memory
      sizeLimit: "2Gi"
  containers:
  - name: model
    image: vllm/vllm-openai:v0.8.5
    command: ["/bin/sh", "-c"]
    args: [
      "vllm serve allenai/wildguard --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
    ]
    env:
    - name: HUGGING_FACE_HUB_TOKEN
      valueFrom:
        secretKeyRef:
          name: hf-token-secret
          key: token
    - name: USE_FASTSAFETENSOR
      value: "true"
    ports:
    - containerPort: 8000
    resources:
      limits:
        nvidia.com/gpu: 1
    securityContext:
      privileged: true
    volumeMounts:
    - name: shm
      mountPath: /dev/shm
    livenessProbe:
      httpGet:
        path: /health
        port: 8000
      initialDelaySeconds: 180
      periodSeconds: 10
    readinessProbe:
      httpGet:
        path: /health
        port: 8000
      initialDelaySeconds: 180
      periodSeconds: 5

Parameters#

Request Parameters#

These parameters control how requests are made to the target model or judge model:

Name

Description

Type

Default

max_retries

Maximum number of retries for failed model inference requests.

Integer

target model: 5
judge model: 16

parallelism

Number of parallel requests to improve throughput.

Integer

target model: 8
judge model: 32

request_timeout

Timeout in seconds for each request.

Integer

target model: 30
judge model: 60

limit_samples

Limit the number of samples to evaluate. Useful for testing. Not available as a judge model parameter.

Integer

null (all samples)

Target Model Parameters#

These parameters control the target model’s generation behavior:

Name

Description

Type

Default

Valid Values

temperature

Sampling temperature for generation.

Float

0.6

0.0–2.0

top_p

Nucleus sampling parameter.

Float

0.95

0.01–1.0

max_tokens

Maximum number of tokens to generate.

Integer

6144

Extra Parameters#

Set these parameters in the params.extra section:

Name

Description

Type

Valid Values

hf_token

Hugging Face token is required for accessing datasets and tokenizers.

String

Valid Hugging Face token

judge.model

Required to specify the judge model to evaluate the target model’s output for safety risks.

String or Dictionary

"meta/llama-3.1-nemoguard-8b-content-safety" or {"api_endpoint": {"model_id": "meta/llama-3.1-nemoguard-8b-content-safety", "url": "<nim-url>/v1/completions"}}

judge.params

Request parameters control how requests are made to the judge model.

Request Parameters

{"parallelism": 2}

Metrics#

Core Supported Metrics in LM Evaluation Harness#

Metric Name

Description

Value Range

Notes

safety_rate

Fraction of model output is categorized as safe.

0.0–1.0

Requires safety judge

Troubleshooting#

View Troubleshooting NeMo Evaluator for general troubleshooting steps of failed evaluation jobs. This section covers common issues for the safety harness.

Hugging Face Error#

Evaluations with safety harness requires Hugging Face access to the respective dataset and model tokenizer. If your job fails with the following errors, visit https://huggingface.co/ and log in to request access to the dataset or model.

datasets.exceptions.DatasetNotFoundError: Dataset 'allenai/wildguardmix' is a gated dataset on the Hub. Visit the dataset page at https://huggingface.co/datasets/allenai/wildguardmix to ask for access.
GatedRepoError: 403 Client Error.

Cannot access gated repo for url https://huggingface.co/<model>/resolve/main/tokenizer_config.json.
Your request to access model <model> is awaiting a review from the repo authors.

Incompatible Judge Model#

Using an unsupported judge model results in a job error. The aegis_v2 evaluation type requires Llama Nemotron Safety Guard V2 judge and wildguard evaluation type requires allenai/wildguard judge. KeyError is an example error for the wrong judge model like the following error.

Metrics calculated


        Evaluation Metrics         
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Safety Category ┃ Average Count ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│           ERROR │           5.0 │
└─────────────────┴───────────────┘

...

Subprocess finished with return code: 0
{'ERROR': 5.0}
Traceback (most recent call last):
...
"/usr/local/lib/python3.10/site-packages/core_evals/safety_eval/__init__.py", line 14, in parse_output
    return parse_output(output_dir)
  File "/usr/local/lib/python3.10/site-packages/core_evals/safety_eval/output.py", line 16, in parse_output
    safety_rate = data['safe'] / sum(data.values())
KeyError: 'safe'

Unexpected Reasoning Traces#

Safety evaluations do not support reasoning traces and may result in the job error below.

ERROR    There are  at least 2 MUT (model under test) responses that start with <think>. Reasoning traces should not be evaluated. Exiting.

If the target model outputs reasoning traces like <think>reasoning context</think>, configure the target model prompt.reasoning_params.end_token to only evaluate on the final thought. Consider specifying config.params.max_tokens to a reasonable limit for the model’s chain of thought to conclude with the expected reasoning end token in order for the reasoning context to be properly omitted for evaluation.

{
  "target": {
    "type": "model",
    "model": {
      "api_endpoint": {},
      "prompt": {
        "reasoning_params": {
          "end_token": "</think>"
        }
      }
    }
  }
}