safety_eval#

This page contains all evaluation tasks for the safety_eval harness.

Task

Description

aegis_v2

Aegis V2 without evaluating reasoning traces. This version is used by the NeMo Safety Toolkit.

aegis_v2_completions

Aegis V2 without evaluating reasoning traces. This variant uses the completions endpoint.

aegis_v2_reasoning

Aegis V2 with evaluating reasoning traces.

compliance

Compliance integrity benchmark — evaluates model responses against a policy YAML using an LLM judge (chat).

wildguard

Wildguard

wildguard_completions

Wildguard. This variant uses the completions endpoint.

aegis_v2#

Aegis V2 without evaluating reasoning traces. This version is used by the NeMo Safety Toolkit.

Harness: safety_eval

Container:

nvcr.io/nvidia/eval-factory/safety-harness:26.01

Container Digest:

sha256:86df570bd581059d1a5133dcc055ea1adb3e2308ac36414e1377331f8eabba76

Container Arch: multiarch

Task Type: aegis_v2

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}}  && {% endif %} {% if config.params.extra.judge.api_key is not none %}export JUDGE_API_KEY=${{config.params.extra.judge.api_key}} && {% endif %} safety-eval  --model-name  {{target.api_endpoint.model_id}} --model-url {{target.api_endpoint.url}} --model-type {{target.api_endpoint.type}}  --judge-url  {{config.params.extra.judge.url}}   --results-dir {{config.output_dir}}   --eval {{config.params.task}}  --mut-inference-params max_tokens={{config.params.max_new_tokens}},temperature={{config.params.temperature}},top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},concurrency={{config.params.parallelism}},retries={{config.params.max_retries}} --judge-inference-params concurrency={{config.params.extra.judge.parallelism}},retries={{config.params.max_retries}}  {% if config.params.extra.dataset is defined and config.params.extra.dataset %} --dataset '{{config.params.extra.dataset}}'{% endif %} {% if config.params.extra.policy is defined and config.params.extra.policy %} --policy '{{config.params.extra.policy}}'{% endif %} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}} {% endif %} {% if config.params.extra.judge.model_id is not none %} --judge-model-name {{config.params.extra.judge.model_id}} {% endif %} {% if config.type == "aegis_v2_reasoning" %} {% if config.params.extra.evaluate_reasoning_traces  %} --evaluate-reasoning-traces {% endif %} {% endif %}
framework_name: safety_eval
pkg_name: safety_eval
config:
  params:
    max_new_tokens: 6144
    max_retries: 5
    parallelism: 8
    task: aegis_v2
    temperature: 0.6
    request_timeout: 30
    top_p: 0.95
    extra:
      judge:
        url: null
        model_id: null
        api_key: null
        parallelism: 32
        request_timeout: 60
        max_retries: 16
      evaluate_reasoning_traces: false
  supported_endpoint_types:
  - chat
  type: aegis_v2
target:
  api_endpoint:
    stream: false

aegis_v2_completions#

Aegis V2 without evaluating reasoning traces. This variant uses the completions endpoint.

Harness: safety_eval

Container:

nvcr.io/nvidia/eval-factory/safety-harness:26.01

Container Digest:

sha256:86df570bd581059d1a5133dcc055ea1adb3e2308ac36414e1377331f8eabba76

Container Arch: multiarch

Task Type: aegis_v2_completions

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}}  && {% endif %} {% if config.params.extra.judge.api_key is not none %}export JUDGE_API_KEY=${{config.params.extra.judge.api_key}} && {% endif %} safety-eval  --model-name  {{target.api_endpoint.model_id}} --model-url {{target.api_endpoint.url}} --model-type {{target.api_endpoint.type}}  --judge-url  {{config.params.extra.judge.url}}   --results-dir {{config.output_dir}}   --eval {{config.params.task}}  --mut-inference-params max_tokens={{config.params.max_new_tokens}},temperature={{config.params.temperature}},top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},concurrency={{config.params.parallelism}},retries={{config.params.max_retries}} --judge-inference-params concurrency={{config.params.extra.judge.parallelism}},retries={{config.params.max_retries}}  {% if config.params.extra.dataset is defined and config.params.extra.dataset %} --dataset '{{config.params.extra.dataset}}'{% endif %} {% if config.params.extra.policy is defined and config.params.extra.policy %} --policy '{{config.params.extra.policy}}'{% endif %} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}} {% endif %} {% if config.params.extra.judge.model_id is not none %} --judge-model-name {{config.params.extra.judge.model_id}} {% endif %} {% if config.type == "aegis_v2_reasoning" %} {% if config.params.extra.evaluate_reasoning_traces  %} --evaluate-reasoning-traces {% endif %} {% endif %}
framework_name: safety_eval
pkg_name: safety_eval
config:
  params:
    max_new_tokens: 6144
    max_retries: 5
    parallelism: 8
    task: aegis_v2
    temperature: 0.6
    request_timeout: 30
    top_p: 0.95
    extra:
      judge:
        url: null
        model_id: null
        api_key: null
        parallelism: 32
        request_timeout: 60
        max_retries: 16
      evaluate_reasoning_traces: false
  supported_endpoint_types:
  - completions
  type: aegis_v2_completions
target:
  api_endpoint:
    stream: false

aegis_v2_reasoning#

Aegis V2 with evaluating reasoning traces.

Harness: safety_eval

Container:

nvcr.io/nvidia/eval-factory/safety-harness:26.01

Container Digest:

sha256:86df570bd581059d1a5133dcc055ea1adb3e2308ac36414e1377331f8eabba76

Container Arch: multiarch

Task Type: aegis_v2_reasoning

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}}  && {% endif %} {% if config.params.extra.judge.api_key is not none %}export JUDGE_API_KEY=${{config.params.extra.judge.api_key}} && {% endif %} safety-eval  --model-name  {{target.api_endpoint.model_id}} --model-url {{target.api_endpoint.url}} --model-type {{target.api_endpoint.type}}  --judge-url  {{config.params.extra.judge.url}}   --results-dir {{config.output_dir}}   --eval {{config.params.task}}  --mut-inference-params max_tokens={{config.params.max_new_tokens}},temperature={{config.params.temperature}},top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},concurrency={{config.params.parallelism}},retries={{config.params.max_retries}} --judge-inference-params concurrency={{config.params.extra.judge.parallelism}},retries={{config.params.max_retries}}  {% if config.params.extra.dataset is defined and config.params.extra.dataset %} --dataset '{{config.params.extra.dataset}}'{% endif %} {% if config.params.extra.policy is defined and config.params.extra.policy %} --policy '{{config.params.extra.policy}}'{% endif %} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}} {% endif %} {% if config.params.extra.judge.model_id is not none %} --judge-model-name {{config.params.extra.judge.model_id}} {% endif %} {% if config.type == "aegis_v2_reasoning" %} {% if config.params.extra.evaluate_reasoning_traces  %} --evaluate-reasoning-traces {% endif %} {% endif %}
framework_name: safety_eval
pkg_name: safety_eval
config:
  params:
    max_new_tokens: 6144
    max_retries: 5
    parallelism: 8
    task: aegis_v2
    temperature: 0.6
    request_timeout: 30
    top_p: 0.95
    extra:
      judge:
        url: null
        model_id: null
        api_key: null
        parallelism: 32
        request_timeout: 60
        max_retries: 16
      evaluate_reasoning_traces: true
  supported_endpoint_types:
  - chat
  type: aegis_v2_reasoning
target:
  api_endpoint:
    stream: false

compliance#

Compliance integrity benchmark — evaluates model responses against a policy YAML using an LLM judge (chat).

Harness: safety_eval

Container:

nvcr.io/nvidia/eval-factory/safety-harness:26.01

Container Digest:

sha256:86df570bd581059d1a5133dcc055ea1adb3e2308ac36414e1377331f8eabba76

Container Arch: multiarch

Task Type: compliance

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}}  && {% endif %} {% if config.params.extra.judge.api_key is not none %}export JUDGE_API_KEY=${{config.params.extra.judge.api_key}} && {% endif %} safety-eval  --model-name  {{target.api_endpoint.model_id}} --model-url {{target.api_endpoint.url}} --model-type {{target.api_endpoint.type}}  --judge-url  {{config.params.extra.judge.url}}   --results-dir {{config.output_dir}}   --eval {{config.params.task}}  --mut-inference-params max_tokens={{config.params.max_new_tokens}},temperature={{config.params.temperature}},top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},concurrency={{config.params.parallelism}},retries={{config.params.max_retries}} --judge-inference-params concurrency={{config.params.extra.judge.parallelism}},retries={{config.params.max_retries}}  {% if config.params.extra.dataset is defined and config.params.extra.dataset %} --dataset '{{config.params.extra.dataset}}'{% endif %} {% if config.params.extra.policy is defined and config.params.extra.policy %} --policy '{{config.params.extra.policy}}'{% endif %} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}} {% endif %} {% if config.params.extra.judge.model_id is not none %} --judge-model-name {{config.params.extra.judge.model_id}} {% endif %} {% if config.type == "aegis_v2_reasoning" %} {% if config.params.extra.evaluate_reasoning_traces  %} --evaluate-reasoning-traces {% endif %} {% endif %}
framework_name: safety_eval
pkg_name: safety_eval
config:
  params:
    max_new_tokens: 6144
    max_retries: 5
    parallelism: 8
    task: compliance
    temperature: 0.6
    request_timeout: 30
    top_p: 0.95
    extra:
      judge:
        url: null
        model_id: null
        api_key: null
        parallelism: 32
        request_timeout: 60
        max_retries: 16
  supported_endpoint_types:
  - chat
  type: compliance
target:
  api_endpoint:
    stream: false

wildguard#

Wildguard

Harness: safety_eval

Container:

nvcr.io/nvidia/eval-factory/safety-harness:26.01

Container Digest:

sha256:86df570bd581059d1a5133dcc055ea1adb3e2308ac36414e1377331f8eabba76

Container Arch: multiarch

Task Type: wildguard

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}}  && {% endif %} {% if config.params.extra.judge.api_key is not none %}export JUDGE_API_KEY=${{config.params.extra.judge.api_key}} && {% endif %} safety-eval  --model-name  {{target.api_endpoint.model_id}} --model-url {{target.api_endpoint.url}} --model-type {{target.api_endpoint.type}}  --judge-url  {{config.params.extra.judge.url}}   --results-dir {{config.output_dir}}   --eval {{config.params.task}}  --mut-inference-params max_tokens={{config.params.max_new_tokens}},temperature={{config.params.temperature}},top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},concurrency={{config.params.parallelism}},retries={{config.params.max_retries}} --judge-inference-params concurrency={{config.params.extra.judge.parallelism}},retries={{config.params.max_retries}}  {% if config.params.extra.dataset is defined and config.params.extra.dataset %} --dataset '{{config.params.extra.dataset}}'{% endif %} {% if config.params.extra.policy is defined and config.params.extra.policy %} --policy '{{config.params.extra.policy}}'{% endif %} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}} {% endif %} {% if config.params.extra.judge.model_id is not none %} --judge-model-name {{config.params.extra.judge.model_id}} {% endif %} {% if config.type == "aegis_v2_reasoning" %} {% if config.params.extra.evaluate_reasoning_traces  %} --evaluate-reasoning-traces {% endif %} {% endif %}
framework_name: safety_eval
pkg_name: safety_eval
config:
  params:
    max_new_tokens: 6144
    max_retries: 5
    parallelism: 8
    task: wildguard
    temperature: 0.6
    request_timeout: 30
    top_p: 0.95
    extra:
      judge:
        url: null
        model_id: null
        api_key: null
        parallelism: 32
        request_timeout: 60
        max_retries: 16
  supported_endpoint_types:
  - chat
  type: wildguard
target:
  api_endpoint:
    stream: false

wildguard_completions#

Wildguard. This variant uses the completions endpoint.

Harness: safety_eval

Container:

nvcr.io/nvidia/eval-factory/safety-harness:26.01

Container Digest:

sha256:86df570bd581059d1a5133dcc055ea1adb3e2308ac36414e1377331f8eabba76

Container Arch: multiarch

Task Type: wildguard_completions

{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}}  && {% endif %} {% if config.params.extra.judge.api_key is not none %}export JUDGE_API_KEY=${{config.params.extra.judge.api_key}} && {% endif %} safety-eval  --model-name  {{target.api_endpoint.model_id}} --model-url {{target.api_endpoint.url}} --model-type {{target.api_endpoint.type}}  --judge-url  {{config.params.extra.judge.url}}   --results-dir {{config.output_dir}}   --eval {{config.params.task}}  --mut-inference-params max_tokens={{config.params.max_new_tokens}},temperature={{config.params.temperature}},top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},concurrency={{config.params.parallelism}},retries={{config.params.max_retries}} --judge-inference-params concurrency={{config.params.extra.judge.parallelism}},retries={{config.params.max_retries}}  {% if config.params.extra.dataset is defined and config.params.extra.dataset %} --dataset '{{config.params.extra.dataset}}'{% endif %} {% if config.params.extra.policy is defined and config.params.extra.policy %} --policy '{{config.params.extra.policy}}'{% endif %} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}} {% endif %} {% if config.params.extra.judge.model_id is not none %} --judge-model-name {{config.params.extra.judge.model_id}} {% endif %} {% if config.type == "aegis_v2_reasoning" %} {% if config.params.extra.evaluate_reasoning_traces  %} --evaluate-reasoning-traces {% endif %} {% endif %}
framework_name: safety_eval
pkg_name: safety_eval
config:
  params:
    max_new_tokens: 6144
    max_retries: 5
    parallelism: 8
    task: wildguard
    temperature: 0.6
    request_timeout: 30
    top_p: 0.95
    extra:
      judge:
        url: null
        model_id: null
        api_key: null
        parallelism: 32
        request_timeout: 60
        max_retries: 16
  supported_endpoint_types:
  - completions
  type: wildguard_completions
target:
  api_endpoint:
    stream: false