safety_eval#
This page contains all evaluation tasks for the safety_eval harness.
Task |
Description |
|---|---|
Aegis V2 without evaluating reasoning traces. This version is used by the NeMo Safety Toolkit. |
|
Aegis V2 without evaluating reasoning traces. This variant uses the completions endpoint. |
|
Aegis V2 with evaluating reasoning traces. |
|
Compliance integrity benchmark — evaluates model responses against a policy YAML using an LLM judge (chat). |
|
Wildguard |
|
Wildguard. This variant uses the completions endpoint. |
aegis_v2#
Aegis V2 without evaluating reasoning traces. This version is used by the NeMo Safety Toolkit.
Harness: safety_eval
Container:
nvcr.io/nvidia/eval-factory/safety-harness:26.01
Container Digest:
sha256:86df570bd581059d1a5133dcc055ea1adb3e2308ac36414e1377331f8eabba76
Container Arch: multiarch
Task Type: aegis_v2
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.judge.api_key is not none %}export JUDGE_API_KEY=${{config.params.extra.judge.api_key}} && {% endif %} safety-eval --model-name {{target.api_endpoint.model_id}} --model-url {{target.api_endpoint.url}} --model-type {{target.api_endpoint.type}} --judge-url {{config.params.extra.judge.url}} --results-dir {{config.output_dir}} --eval {{config.params.task}} --mut-inference-params max_tokens={{config.params.max_new_tokens}},temperature={{config.params.temperature}},top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},concurrency={{config.params.parallelism}},retries={{config.params.max_retries}} --judge-inference-params concurrency={{config.params.extra.judge.parallelism}},retries={{config.params.max_retries}} {% if config.params.extra.dataset is defined and config.params.extra.dataset %} --dataset '{{config.params.extra.dataset}}'{% endif %} {% if config.params.extra.policy is defined and config.params.extra.policy %} --policy '{{config.params.extra.policy}}'{% endif %} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}} {% endif %} {% if config.params.extra.judge.model_id is not none %} --judge-model-name {{config.params.extra.judge.model_id}} {% endif %} {% if config.type == "aegis_v2_reasoning" %} {% if config.params.extra.evaluate_reasoning_traces %} --evaluate-reasoning-traces {% endif %} {% endif %}
framework_name: safety_eval
pkg_name: safety_eval
config:
params:
max_new_tokens: 6144
max_retries: 5
parallelism: 8
task: aegis_v2
temperature: 0.6
request_timeout: 30
top_p: 0.95
extra:
judge:
url: null
model_id: null
api_key: null
parallelism: 32
request_timeout: 60
max_retries: 16
evaluate_reasoning_traces: false
supported_endpoint_types:
- chat
type: aegis_v2
target:
api_endpoint:
stream: false
aegis_v2_completions#
Aegis V2 without evaluating reasoning traces. This variant uses the completions endpoint.
Harness: safety_eval
Container:
nvcr.io/nvidia/eval-factory/safety-harness:26.01
Container Digest:
sha256:86df570bd581059d1a5133dcc055ea1adb3e2308ac36414e1377331f8eabba76
Container Arch: multiarch
Task Type: aegis_v2_completions
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.judge.api_key is not none %}export JUDGE_API_KEY=${{config.params.extra.judge.api_key}} && {% endif %} safety-eval --model-name {{target.api_endpoint.model_id}} --model-url {{target.api_endpoint.url}} --model-type {{target.api_endpoint.type}} --judge-url {{config.params.extra.judge.url}} --results-dir {{config.output_dir}} --eval {{config.params.task}} --mut-inference-params max_tokens={{config.params.max_new_tokens}},temperature={{config.params.temperature}},top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},concurrency={{config.params.parallelism}},retries={{config.params.max_retries}} --judge-inference-params concurrency={{config.params.extra.judge.parallelism}},retries={{config.params.max_retries}} {% if config.params.extra.dataset is defined and config.params.extra.dataset %} --dataset '{{config.params.extra.dataset}}'{% endif %} {% if config.params.extra.policy is defined and config.params.extra.policy %} --policy '{{config.params.extra.policy}}'{% endif %} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}} {% endif %} {% if config.params.extra.judge.model_id is not none %} --judge-model-name {{config.params.extra.judge.model_id}} {% endif %} {% if config.type == "aegis_v2_reasoning" %} {% if config.params.extra.evaluate_reasoning_traces %} --evaluate-reasoning-traces {% endif %} {% endif %}
framework_name: safety_eval
pkg_name: safety_eval
config:
params:
max_new_tokens: 6144
max_retries: 5
parallelism: 8
task: aegis_v2
temperature: 0.6
request_timeout: 30
top_p: 0.95
extra:
judge:
url: null
model_id: null
api_key: null
parallelism: 32
request_timeout: 60
max_retries: 16
evaluate_reasoning_traces: false
supported_endpoint_types:
- completions
type: aegis_v2_completions
target:
api_endpoint:
stream: false
aegis_v2_reasoning#
Aegis V2 with evaluating reasoning traces.
Harness: safety_eval
Container:
nvcr.io/nvidia/eval-factory/safety-harness:26.01
Container Digest:
sha256:86df570bd581059d1a5133dcc055ea1adb3e2308ac36414e1377331f8eabba76
Container Arch: multiarch
Task Type: aegis_v2_reasoning
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.judge.api_key is not none %}export JUDGE_API_KEY=${{config.params.extra.judge.api_key}} && {% endif %} safety-eval --model-name {{target.api_endpoint.model_id}} --model-url {{target.api_endpoint.url}} --model-type {{target.api_endpoint.type}} --judge-url {{config.params.extra.judge.url}} --results-dir {{config.output_dir}} --eval {{config.params.task}} --mut-inference-params max_tokens={{config.params.max_new_tokens}},temperature={{config.params.temperature}},top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},concurrency={{config.params.parallelism}},retries={{config.params.max_retries}} --judge-inference-params concurrency={{config.params.extra.judge.parallelism}},retries={{config.params.max_retries}} {% if config.params.extra.dataset is defined and config.params.extra.dataset %} --dataset '{{config.params.extra.dataset}}'{% endif %} {% if config.params.extra.policy is defined and config.params.extra.policy %} --policy '{{config.params.extra.policy}}'{% endif %} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}} {% endif %} {% if config.params.extra.judge.model_id is not none %} --judge-model-name {{config.params.extra.judge.model_id}} {% endif %} {% if config.type == "aegis_v2_reasoning" %} {% if config.params.extra.evaluate_reasoning_traces %} --evaluate-reasoning-traces {% endif %} {% endif %}
framework_name: safety_eval
pkg_name: safety_eval
config:
params:
max_new_tokens: 6144
max_retries: 5
parallelism: 8
task: aegis_v2
temperature: 0.6
request_timeout: 30
top_p: 0.95
extra:
judge:
url: null
model_id: null
api_key: null
parallelism: 32
request_timeout: 60
max_retries: 16
evaluate_reasoning_traces: true
supported_endpoint_types:
- chat
type: aegis_v2_reasoning
target:
api_endpoint:
stream: false
compliance#
Compliance integrity benchmark — evaluates model responses against a policy YAML using an LLM judge (chat).
Harness: safety_eval
Container:
nvcr.io/nvidia/eval-factory/safety-harness:26.01
Container Digest:
sha256:86df570bd581059d1a5133dcc055ea1adb3e2308ac36414e1377331f8eabba76
Container Arch: multiarch
Task Type: compliance
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.judge.api_key is not none %}export JUDGE_API_KEY=${{config.params.extra.judge.api_key}} && {% endif %} safety-eval --model-name {{target.api_endpoint.model_id}} --model-url {{target.api_endpoint.url}} --model-type {{target.api_endpoint.type}} --judge-url {{config.params.extra.judge.url}} --results-dir {{config.output_dir}} --eval {{config.params.task}} --mut-inference-params max_tokens={{config.params.max_new_tokens}},temperature={{config.params.temperature}},top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},concurrency={{config.params.parallelism}},retries={{config.params.max_retries}} --judge-inference-params concurrency={{config.params.extra.judge.parallelism}},retries={{config.params.max_retries}} {% if config.params.extra.dataset is defined and config.params.extra.dataset %} --dataset '{{config.params.extra.dataset}}'{% endif %} {% if config.params.extra.policy is defined and config.params.extra.policy %} --policy '{{config.params.extra.policy}}'{% endif %} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}} {% endif %} {% if config.params.extra.judge.model_id is not none %} --judge-model-name {{config.params.extra.judge.model_id}} {% endif %} {% if config.type == "aegis_v2_reasoning" %} {% if config.params.extra.evaluate_reasoning_traces %} --evaluate-reasoning-traces {% endif %} {% endif %}
framework_name: safety_eval
pkg_name: safety_eval
config:
params:
max_new_tokens: 6144
max_retries: 5
parallelism: 8
task: compliance
temperature: 0.6
request_timeout: 30
top_p: 0.95
extra:
judge:
url: null
model_id: null
api_key: null
parallelism: 32
request_timeout: 60
max_retries: 16
supported_endpoint_types:
- chat
type: compliance
target:
api_endpoint:
stream: false
wildguard#
Wildguard
Harness: safety_eval
Container:
nvcr.io/nvidia/eval-factory/safety-harness:26.01
Container Digest:
sha256:86df570bd581059d1a5133dcc055ea1adb3e2308ac36414e1377331f8eabba76
Container Arch: multiarch
Task Type: wildguard
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.judge.api_key is not none %}export JUDGE_API_KEY=${{config.params.extra.judge.api_key}} && {% endif %} safety-eval --model-name {{target.api_endpoint.model_id}} --model-url {{target.api_endpoint.url}} --model-type {{target.api_endpoint.type}} --judge-url {{config.params.extra.judge.url}} --results-dir {{config.output_dir}} --eval {{config.params.task}} --mut-inference-params max_tokens={{config.params.max_new_tokens}},temperature={{config.params.temperature}},top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},concurrency={{config.params.parallelism}},retries={{config.params.max_retries}} --judge-inference-params concurrency={{config.params.extra.judge.parallelism}},retries={{config.params.max_retries}} {% if config.params.extra.dataset is defined and config.params.extra.dataset %} --dataset '{{config.params.extra.dataset}}'{% endif %} {% if config.params.extra.policy is defined and config.params.extra.policy %} --policy '{{config.params.extra.policy}}'{% endif %} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}} {% endif %} {% if config.params.extra.judge.model_id is not none %} --judge-model-name {{config.params.extra.judge.model_id}} {% endif %} {% if config.type == "aegis_v2_reasoning" %} {% if config.params.extra.evaluate_reasoning_traces %} --evaluate-reasoning-traces {% endif %} {% endif %}
framework_name: safety_eval
pkg_name: safety_eval
config:
params:
max_new_tokens: 6144
max_retries: 5
parallelism: 8
task: wildguard
temperature: 0.6
request_timeout: 30
top_p: 0.95
extra:
judge:
url: null
model_id: null
api_key: null
parallelism: 32
request_timeout: 60
max_retries: 16
supported_endpoint_types:
- chat
type: wildguard
target:
api_endpoint:
stream: false
wildguard_completions#
Wildguard. This variant uses the completions endpoint.
Harness: safety_eval
Container:
nvcr.io/nvidia/eval-factory/safety-harness:26.01
Container Digest:
sha256:86df570bd581059d1a5133dcc055ea1adb3e2308ac36414e1377331f8eabba76
Container Arch: multiarch
Task Type: wildguard_completions
{% if target.api_endpoint.api_key_name is not none %}export API_KEY=${{target.api_endpoint.api_key_name}} && {% endif %} {% if config.params.extra.judge.api_key is not none %}export JUDGE_API_KEY=${{config.params.extra.judge.api_key}} && {% endif %} safety-eval --model-name {{target.api_endpoint.model_id}} --model-url {{target.api_endpoint.url}} --model-type {{target.api_endpoint.type}} --judge-url {{config.params.extra.judge.url}} --results-dir {{config.output_dir}} --eval {{config.params.task}} --mut-inference-params max_tokens={{config.params.max_new_tokens}},temperature={{config.params.temperature}},top_p={{config.params.top_p}},timeout={{config.params.request_timeout}},concurrency={{config.params.parallelism}},retries={{config.params.max_retries}} --judge-inference-params concurrency={{config.params.extra.judge.parallelism}},retries={{config.params.max_retries}} {% if config.params.extra.dataset is defined and config.params.extra.dataset %} --dataset '{{config.params.extra.dataset}}'{% endif %} {% if config.params.extra.policy is defined and config.params.extra.policy %} --policy '{{config.params.extra.policy}}'{% endif %} {% if config.params.limit_samples is not none %} --limit {{config.params.limit_samples}} {% endif %} {% if config.params.extra.judge.model_id is not none %} --judge-model-name {{config.params.extra.judge.model_id}} {% endif %} {% if config.type == "aegis_v2_reasoning" %} {% if config.params.extra.evaluate_reasoning_traces %} --evaluate-reasoning-traces {% endif %} {% endif %}
framework_name: safety_eval
pkg_name: safety_eval
config:
params:
max_new_tokens: 6144
max_retries: 5
parallelism: 8
task: wildguard
temperature: 0.6
request_timeout: 30
top_p: 0.95
extra:
judge:
url: null
model_id: null
api_key: null
parallelism: 32
request_timeout: 60
max_retries: 16
supported_endpoint_types:
- completions
type: wildguard_completions
target:
api_endpoint:
stream: false