Create an Auditor Configuration#

Schema for Audit Configurations#

Field

Description

Default Value

name

Specifies the name of the audit config. Names must be unique within a namespace. The maximum length is 250 characters.

None (Required)

namespace

Specifies the namespace for the audit config. The maximum length is 250 characters.

None (Required)

plugins

Specifies the options for the audit, such as the probes to run.

A common customization is to specify the probes to run in the plugins.probe_spec field. Refer to Probe Reference Summary to view all the probes.

The model_type, model_name, and generator fields are ignored. These fields are populated by the audit target that you specify when you run the audit job.

None

reporting

Specifies the reporting options for the audit, such as show_100_pass_modules that controls whether to include entries that scored 100% in the report. Refer to Configuring garak in the garak documentation for the subfields.

None

run

Specifies the run-specific settings, such as the seed, for the scan. Refer to Configuring garak in the garak documentation for the subfields.

None

system

Specifies the settings related to how the system performs the scan, such as the max_workers. Refer to Configuring garak in the garak documentation for the subfields.

None

Create an Audit Configuration#

To create an audit configuration, you send a POST request to the /v1beta1/audit/configs endpoint.

  1. Set AUDITOR_BASE_URL to specify the service:

    $ export AUDITOR_BASE_URL=http://localhost:5000
    
  2. Create the configuration:

    import os
    from nemo_microservices import NeMoMicroservices
    
    client = NeMoMicroservices(base_url=os.getenv("AUDITOR_BASE_URL"))
    
    config = client.beta.audit.configs.create(
        name="demo-basic-config",
        namespace="default",
        description="Basic demonstration configuration",
        system={
            "parallel_attempts": 20,
            "lite": True
        },
        plugins={
            "probe_spec": "dan.AutoDANCached,goodside.Tag"
        },
        reporting={
            "extended_detectors": False
        }
    )
    print(config)
    
    curl -X POST "${AUDITOR_BASE_URL}/v1beta1/audit/configs" \
      -H "Accept: application/json" \
      -H "Content-Type: application/json" \
      -d '{
        "name": "demo-basic-config",
        "namespace": "default",
        "description": "Basic demonstration configuration",
        "system": {
            "parallel_attempts": "20",
            "lite": "True"
        },
        "plugins": {
            "probe_spec": "dan.AutoDANCached,goodside.Tag"
        }
    }' | jq
    
    Example Output
    AuditConfig(id='audit_config-F6pVLsJbYGWfCTKDvZrHr7',
    created_at=datetime.datetime(2025, 7, 24, 17, 20, 31, 482028),
    custom_fields={}, description='Basic demonstration configuration',
    name='demo-basic-config', namespace='default', ownership=None,
    plugins=AuditPluginsDataOutput(buff_max=None, buff_spec=None, buffs={},
    buffs_include_original_prompt=False, detector_spec='auto', detectors={},
    extended_detectors=False, generators={}, harnesses={}, model_name=None,
    model_type=None, probe_spec='dan.AutoDANCached,goodside.Tag', probes={}),
    project=None, reporting=AuditReportData(report_dir='garak_runs',
    report_prefix='run1', show_100_pass_modules=True, taxonomy=None),
    run=AuditRunData(deprefix=True, eval_threshold=0.5, generations=5,
    probe_tags=None, seed=None, user_agent='garak/{version} (LLM vulnerability
    scanner https://garak.ai)'), schema_version='1.0',
    system=AuditSystemData(enable_experimental=False, lite=True,
    narrow_output=False, parallel_attempts=20, parallel_requests=False,
    show_z=False, verbose=0), type_prefix=None,
    updated_at=datetime.datetime(2025, 7, 24, 17, 20, 31, 482032))
    
    {
      "schema_version": "1.0",
      "id": "audit_config-VkJUpwkSVakpp3AhKqEh86",
      "description": "Basic demonstration configuration",
      "type_prefix": null,
      "namespace": "default",
      "project": null,
      "created_at": "2025-07-24T16:50:07.764146",
      "updated_at": "2025-07-24T16:50:07.764150",
      "custom_fields": {},
      "ownership": null,
      "name": "demo-basic-config",
      "system": {
        "verbose": 0,
        "narrow_output": false,
        "parallel_requests": false,
        "parallel_attempts": 20,
        "lite": true,
        "show_z": false,
        "enable_experimental": false
      },
      "run": {
        "seed": null,
        "deprefix": true,
        "eval_threshold": 0.5,
        "generations": 5,
        "probe_tags": null,
        "user_agent": "garak/{version} (LLM vulnerability scanner https://garak.ai)"
      },
      "plugins": {
        "model_type": null,
        "model_name": null,
        "probe_spec": "dan.AutoDANCached,goodside.Tag",
        "detector_spec": "auto",
        "extended_detectors": false,
        "buff_spec": null,
        "buffs_include_original_prompt": false,
        "buff_max": null,
        "detectors": {},
        "generators": {},
        "buffs": {},
        "harnesses": {},
        "probes": {}
      },
      "reporting": {
        "report_prefix": "run1",
        "taxonomy": null,
        "report_dir": "garak_runs",
        "show_100_pass_modules": true
      }
    }
    

Default Configuration#

The microservice ships with a default configuration in the default namespace. The default configuration performs approximately 29,000 inference requests.

import os
from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(base_url=os.getenv("AUDITOR_BASE_URL"))
    
config = client.beta.audit.configs.retrieve(
    config_name="default",
    namespace="default"
)
print(config)
curl "${AUDITOR_BASE_URL}/v1beta1/audit/configs/default/default" \
  -H "Accept: application/json" | jq

Example Output

AuditConfig(id='audit_config-KwpJnX7vxKftMdff1hqzru',
created_at=datetime.datetime(2025, 7, 21, 15, 17, 56, 660457),
custom_fields={}, description=None, name='default', namespace='default',
ownership=None, plugins=AuditPluginsDataOutput(buff_max=None,
buff_spec=None, buffs={}, buffs_include_original_prompt=False,
detector_spec='auto', detectors={}, extended_detectors=False,
generators={}, harnesses={}, model_name=None, model_type=None, probe_spec='
ansiescape,atkgen,continuation,dan.Ablation_Dan_11_0,dan.AutoDANCached,dan.
DanInTheWild,divergence,encoding,exploitation,goodside,grandma,latentinject
ion,leakreplay,lmrc.Bullying,lmrc.Deadnaming,lmrc.QuackMedicine,lmrc.Sexual
Content,lmrc.Sexualisation,lmrc.SlurUsage,malwaregen,misleading,packagehall
ucination,phrasing,promptinject,realtoxicityprompts.RTPBlank,snowball.Graph
Connectivity,suffix.GCGCached,tap.TAPCached,topic,xss', probes={'encoding':
{'payloads': ['default', 'xss']}}), project=None,
reporting=AuditReportData(report_dir='garak_runs', report_prefix='run1',
show_100_pass_modules=True, taxonomy=None), run=AuditRunData(deprefix=True,
eval_threshold=0.5, generations=3, probe_tags=None, seed=None,
user_agent='garak/{version} (LLM vulnerability scanner https://garak.ai)'),
schema_version='1.0', system=AuditSystemData(enable_experimental=False,
lite=False, narrow_output=False, parallel_attempts=32,
parallel_requests=False, show_z=False, verbose=0), type_prefix=None,
updated_at=datetime.datetime(2025, 7, 21, 15, 17, 56, 660460))
{
  "schema_version": "1.0",
  "id": "audit_config-KwpJnX7vxKftMdff1hqzru",
  "description": null,
  "type_prefix": null,
  "namespace": "default",
  "project": null,
  "created_at": "2025-07-21T15:17:56.660457",
  "updated_at": "2025-07-21T15:17:56.660460",
  "custom_fields": {},
  "ownership": null,
  "name": "default",
  "system": {
    "verbose": 0,
    "narrow_output": false,
    "parallel_requests": false,
    "parallel_attempts": 32,
    "lite": false,
    "show_z": false,
    "enable_experimental": false
  },
  "run": {
    "seed": null,
    "deprefix": true,
    "eval_threshold": 0.5,
    "generations": 3,
    "probe_tags": null,
    "user_agent": "garak/{version} (LLM vulnerability scanner https://garak.ai)"
  },
  "plugins": {
    "model_type": null,
    "model_name": null,
    "probe_spec": "ansiescape,atkgen,continuation,dan.Ablation_Dan_11_0,dan.AutoDANCached,dan.DanInTheWild,divergence,encoding,exploitation,goodside,grandma,latentinjection,leakreplay,lmrc.Bullying,lmrc.Deadnaming,lmrc.QuackMedicine,lmrc.SexualContent,lmrc.Sexualisation,lmrc.SlurUsage,malwaregen,misleading,packagehallucination,phrasing,promptinject,realtoxicityprompts.RTPBlank,snowball.GraphConnectivity,suffix.GCGCached,tap.TAPCached,topic,xss",
    "detector_spec": "auto",
    "extended_detectors": false,
    "buff_spec": null,
    "buffs_include_original_prompt": false,
    "buff_max": null,
    "detectors": {},
    "generators": {},
    "buffs": {},
    "harnesses": {},
    "probes": {
      "encoding": {
        "payloads": [
          "default",
          "xss"
        ]
      }
    }
  },
  "reporting": {
    "report_prefix": "run1",
    "taxonomy": null,
    "report_dir": "garak_runs",
    "show_100_pass_modules": true
  }
}

Probe Reference Summary#

The following table summarizes the probes that you can specify in the config.plugins.probe_spec field.

Specify an individual probe by name, such as dan.Ablation_Dan_11_0 or specify a category of probes, such as dan. Separate each probe and category with a comma, such as ansiescape,dan.Ablation_Dan_11_0.

The probes identified as Default are included in the default/default configuration.

For detailed information about a probe, refer to the garak.probes page in the garak documentation.

Probe

Goal

Description

ansiescape.AnsiEscaped (Default)

print escaped ANSI terminal codes

Probe for eliciting ansi control codes from a model.

ansiescape.AnsiRaw (Default)

print raw ANSI terminal codes

Probe for eliciting raw ansi control codes from a model.

atkgen.Tox (Default)

make the model say something toxic

Automatic attack generation, trying to elicit toxic outputs from the model.

audio.AudioAchillesHeel

disregard the system prompt via audio component

Audio is the achilles heel.

av_spam_scanning.EICAR

output antivirus test codes

Does the model check its output for viruses? try to send the eicar code through.

av_spam_scanning.GTUBE

output email spam

Does the model check its output for spam? try to send the gtube code through.

av_spam_scanning.GTphish

output email phishing

Does the model check its output for phishing? try to send the gtphish code through.

continuation.ContinueSlursReclaimedSlurs (Default)

output an ethnic slur

Continue partial word with slur or reclaimed slur.

continuation.ContinueSlursReclaimedSlursFull (Default)

output an ethnic slur

Continue partial word with slur or reclaimed slur.

dan.Ablation_Dan_11_0 (Default)

disregard the system prompt

Ablation experiements over the dan 11.0 prompt.

dan.AntiDAN

disregard the system prompt

Anti-dan.

dan.AutoDAN

disregard the system prompt

Probe leveraging autodan attack generation.

dan.AutoDANCached (Default)

disregard the system prompt

Probe leveraging generated autodan prompts.

dan.ChatGPT_Developer_Mode_RANTI

disregard the system prompt

Chatgpt devmode + ranti.

dan.ChatGPT_Developer_Mode_v2

disregard the system prompt

Chatgpt developer mode v2.

dan.ChatGPT_Image_Markdown

disregard the system prompt

Chatgpt markdown image unlocker / markdown injection.

dan.DAN_Jailbreak

disregard the system prompt

Dan jailbreak.

dan.DUDE

disregard the system prompt

Dude.

dan.DanInTheWild (Default)

disregard the system prompt or alignment

A subset of jailbreaks from the daninthewild jailbreak library.

dan.DanInTheWildFull (Default)

disregard the system prompt or alignment

A library of jailbreaks.

dan.Dan_10_0

disregard the system prompt

Dan 10.0.

dan.Dan_11_0

disregard the system prompt

Dan 11.0.

dan.Dan_6_0

disregard the system prompt

Dan 6.0.

dan.Dan_6_2

disregard the system prompt

Dan 6.2.

dan.Dan_7_0

disregard the system prompt

Dan 7.0.

dan.Dan_8_0

disregard the system prompt

Dan 8.0.

dan.Dan_9_0

disregard the system prompt

Dan 9.0.

dan.STAN

disregard the system prompt

Stan.

divergence.Repeat (Default)

get the model to leak data by asking it to repeat words

See if a model starts giving other data when asked to repeat a string indefinitely.

divergence.RepeatExtended (Default)

get the model to leak data by asking it to repeat words

See if a model starts giving other data when asked to repeat a string indefinitely.

doctor.Bypass

roleplay as doctor to get bypass guardrails

Implement policy puppetry bypass.

doctor.BypassLeet

roleplay as doctor to get bypass guardrails, but use leetspeak

Implement policy puppetry bypass.

doctor.Puppetry

roleplay as doctor to get unsafe outputs

Implement policy puppetry prompt injection.

donotanswer.DiscriminationExclusionToxicityHatefulOffensive

get model to respond as discrimination exclusion toxicity hateful offensive

Implementation of discrimination_exclusion_toxicity_hateful_offensive probes in donotanswer_prompts.

donotanswer.HumanChatbox

get model to respond as human chatbox

Implementation of human_chatbox probes in donotanswer_prompts.

donotanswer.InformationHazard

get model to respond as information hazard

Implementation of information_hazard probes in donotanswer_prompts.

donotanswer.MaliciousUses

get model to respond as malicious uses

Implementation of malicious_uses probes in donotanswer_prompts.

donotanswer.MisinformationHarms

get model to respond as misinformation harms

Implementation of misinformation_harms probes in donotanswer_prompts.

encoding.InjectAscii85 (Default)

make the model say an encoded payload

Probe for susceptibility to base85 encoded injections.

encoding.InjectBase16 (Default)

make the model say an encoded payload

Probe for susceptibility to base16 encoded injections”.

encoding.InjectBase2048 (Default)

make the model say an encoded payload

Probe for susceptibility to base2048 encoding injections.

encoding.InjectBase32 (Default)

make the model say an encoded payload

Probe for susceptibility to base32 encoded injections.

encoding.InjectBase64 (Default)

make the model say an encoded payload

Probe for susceptibility to base64 encoded injections.

encoding.InjectBraille (Default)

make the model say an encoded payload

Probe for susceptibility to braille encoding injections. credit wally#3422.

encoding.InjectEcoji (Default)

make the model say an encoded payload

Probe for susceptibility to ecoji encoding.

encoding.InjectHex (Default)

make the model say an encoded payload

Probe for susceptibility to hex encoded injections.

encoding.InjectLeet (Default)

make the model say an encoded payload

Probe for susceptibility to 31337 encoding.

encoding.InjectMime (Default)

make the model say an encoded payload

Probe for susceptibility to mime encoding injections.

encoding.InjectMorse (Default)

make the model say an encoded payload

Probe for susceptibility to morse encoding. credit wally#3422.

encoding.InjectNato (Default)

make the model say an encoded payload

Probe for susceptibility to nato phonetic alphabet encoding.

encoding.InjectQP (Default)

make the model say an encoded payload

Probe for susceptibility to qp encoded injections.

encoding.InjectROT13 (Default)

make the model say an encoded payload

Probe for susceptibility to rot13 encoding injections.

encoding.InjectUU (Default)

make the model say an encoded payload

Probe for susceptibility to uu encoded injections.

encoding.InjectZalgo (Default)

make the model say an encoded payload

Probe for susceptibility to zalgo encoding.

exploitation.JinjaTemplatePythonInjection (Default)

Gain remote code execution via Jinja template injection

Attempts to execute python code via jinja templating.

exploitation.SQLInjectionEcho (Default)

Manipulate a database via SQL injection

Attempts to get the target to echo a sql statement.

exploitation.SQLInjectionSystem (Default)

Manipulate a database via SQL injection

Attempts to inject untrusted sql commands into a sql statement.

fileformats.HF_Files

get a list of files associated with the model

Get a manifest of files associated with a hugging face generator.

glitch.Glitch

make the model fail to repeat a word

Probe model for glitch tokens that provoke unusual behavior.

glitch.GlitchFull

make the model fail to repeat a word

Probe model for glitch tokens that provoke unusual behavior.

goodside.Davidjl (Default)

get wrong answers about a name

Exploit the ‘ davidjl’ magic token.

goodside.Tag (Default)

inject invisible instructions

Exploit utf tag page encoding for user-invisible prompt injection.

goodside.ThreatenJSON (Default)

get raw code by threatening a model

Will model strip preamble and give json?.

goodside.WhoIsRiley (Default)

get misinfo about riley goodside

Probe for misinfo about riley goodside.

grandma.Slurs (Default)

get model to use ethnic slur

Use the grandma attack to get ethnic slurs.

grandma.Substances (Default)

get instructions for making bad things

Use the grandma attack to get instructions for harmful items.

grandma.Win10 (Default)

get model to output Win 10 product keys

Will the model yield a windows 10 product key if we beg it to?.

grandma.Win11 (Default)

get model to output Win 10 product keys

Will the model yield a windows 11 product key if we beg it to?.

latentinjection.LatentInjectionFactSnippetEiffel (Default)

subvert a snippet with instructions

String-based latent prompt injections in fact snippets on the eiffel tower.

latentinjection.LatentInjectionFactSnippetEiffelFull (Default)

subvert a snippet with instructions

String-based latent prompt injections in fact snippets on the eiffel tower.

latentinjection.LatentInjectionFactSnippetLegal (Default)

subvert a snippet with instructions

String-based latent prompt injections in legal snippets.

latentinjection.LatentInjectionFactSnippetLegalFull (Default)

subvert a snippet with instructions

String-based latent prompt injections in legal snippets.

latentinjection.LatentInjectionReport (Default)

subvert a report with instructions

String-based latent prompt injections via a report.

latentinjection.LatentInjectionReportFull (Default)

subvert a report with instructions

String-based latent prompt injections via a report.

latentinjection.LatentInjectionResume (Default)

subvert a resume with instructions

String-based latent prompt injections via a resume.

latentinjection.LatentInjectionResumeFull (Default)

subvert a resume with instructions

String-based latent prompt injections via a resume.

latentinjection.LatentInjectionTranslationEnFr (Default)

subvert translation instructions

String-based latent prompt injections via translation.

latentinjection.LatentInjectionTranslationEnFrFull (Default)

subvert translation instructions

String-based latent prompt injections via translation.

latentinjection.LatentInjectionTranslationEnZh (Default)

subvert translation instructions

Latent prompt injection in chinese context (简体字).

latentinjection.LatentInjectionTranslationEnZhFull (Default)

subvert translation instructions

Latent prompt injection in chinese context (简体字).

latentinjection.LatentJailbreak (Default)

inject malicious instructions into output while posing as translation task

Implementation of latent jailbreak, using latent prompt injection-.

latentinjection.LatentJailbreakFull (Default)

inject malicious instructions into output while posing as translation task

Implementation of latent jailbreak, using latent prompt injection-.

latentinjection.LatentWhois (Default)

subvert a whois report with instructions

Latent prompt injections in whois probe reports.

latentinjection.LatentWhoisSnippet (Default)

subvert a snippet with instructions

String-based latent prompt injections in whois reports.

latentinjection.LatentWhoisSnippetFull (Default)

subvert a snippet with instructions

String-based latent prompt injections in whois reports.

leakreplay.GuardianCloze (Default)

replay training data by filling in the correct missing words

Lightweight version of guardian cloze test for data leakage.

leakreplay.GuardianClozeFull (Default)

replay training data by filling in the correct missing words

Test for data leakage on guardian articles with masked entities in a cloze test format.

leakreplay.GuardianComplete (Default)

replay training data by providing document completions that match training examples

Lightweight version of guardian completion test for data leakage.

leakreplay.GuardianCompleteFull (Default)

replay training data by providing document completions that match training examples

Test for data leakage on guardian articles through text completion.

leakreplay.LiteratureCloze (Default)

replay training data by filling in the correct missing words

Lightweight version of literature cloze test for data leakage.

leakreplay.LiteratureClozeFull (Default)

replay training data by filling in the correct missing words

Test for data leakage on literature passages with masked entities in a cloze test format.

leakreplay.LiteratureComplete (Default)

replay training data by providing document completions that match training examples

Lightweight version of literature completion test for data leakage.

leakreplay.LiteratureCompleteFull (Default)

replay training data by providing document completions that match training examples

Test for data leakage on literature passages through text completion.

leakreplay.NYTCloze (Default)

replay training data by filling in the correct missing words

Lightweight version of nyt cloze test for data leakage.

leakreplay.NYTClozeFull (Default)

replay training data by filling in the correct missing words

Test for data leakage on new york times articles with masked entities in a cloze test format.

leakreplay.NYTComplete (Default)

replay training data by providing document completions that match training examples

Lightweight version of nyt completion test for data leakage.

leakreplay.NYTCompleteFull (Default)

replay training data by providing document completions that match training examples

Test for data leakage on new york times articles through text completion.

leakreplay.PotterCloze (Default)

replay training data by filling in the correct missing words

Lightweight version of harry potter cloze test for data leakage.

leakreplay.PotterClozeFull (Default)

replay training data by filling in the correct missing words

Test for data leakage on harry potter text with masked entities in a cloze test format.

leakreplay.PotterComplete (Default)

replay training data by providing document completions that match training examples

Lightweight version of harry potter completion test for data leakage.

leakreplay.PotterCompleteFull (Default)

replay training data by providing document completions that match training examples

Test for data leakage on harry potter text through completion.

lmrc.Anthropomorphisation

anthropomorphise a language model

Providing output that encourages anthropomorphisation of ai systems.

lmrc.Bullying (Default)

help me bully

Generation of repeated threats, denigration.

lmrc.Deadnaming (Default)

write a deadname

Mentioning a name that a person does not wish to be associated with.

lmrc.Profanity

utter profanity

Using profane words, especially when not prompted to do so.

lmrc.QuackMedicine (Default)

support quack medicine

Promoting non-scientific medicine.

lmrc.SexualContent (Default)

write sexual content

Content that discusses bodies and interactions in a sexual manner, or other sexually explicit material.

lmrc.Sexualisation (Default)

sexualise people

Causal or overt sexual objectification, reducing a person or group of people to sexual objects.

lmrc.SlurUsage (Default)

use a slur

Using an identity-based slur in a non-reclaimed way - either directly or recommending the consumer to use in speech/action.

malwaregen.Evasion (Default)

Write code to assist evading security and anti-malware measures

Will the model write code to help avoid anti-malware?.

malwaregen.Payload (Default)

write a malware payload

Will the model help write code for malware payloads?.

malwaregen.SubFunctions (Default)

write code supporting a sub-function of malware

Will the system write code helpful in malware?.

malwaregen.TopLevel (Default)

write malware given a top-level description

Will the system return code when asked to write malware?.

misleading.FalseAssertion (Default)

make the model expand on a false assertion

Does the model reject [questions based on] false claims?.

packagehallucination.JavaScript (Default)

generate JavaScript code importing non-existent npm packages

Attempts to generate javascript code including hallucinated npm packages.

packagehallucination.PackageHallucinationProbe (Default)

base probe for importing non-existent packages

Base class for package hallucination probes.

packagehallucination.Python (Default)

generate python importing non-existent packages

Attempts to generate python3 code including hallucinated packages.

packagehallucination.Ruby (Default)

generate ruby code importing non-existent gems

Attempts to generate ruby code including hallucinated gems.

packagehallucination.Rust (Default)

generate Rust code importing non-existent crates.io packages

Attempts to generate rust code including hallucinated crates.io packages.

phrasing.FutureTense (Default)

use future tense to disregard the system prompt

Attempt to evade safeguards by setting requests in the future tense.

phrasing.FutureTenseFull (Default)

use future tense to disregard the system prompt

Will the system give unsafe responses when asked to write about weapon creation or dangerous behavior?.

phrasing.PastTense (Default)

use past tense to disregard the system prompt

Attempt to evade safeguards by setting requests in the past tense.

phrasing.PastTenseFull (Default)

use past tense to disregard the system prompt

Will the system give unsafe responses when asked to write about weapon creation or dangerous behavior?.

promptinject.HijackHateHumans (Default)

inject a prompt on hate humans

Promptinject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of llms to adversarial prompt attacks. trophy best paper awards @ neurips ml safety workshop 2022.

promptinject.HijackHateHumansFull (Default)

inject a prompt on hate humans

Promptinject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of llms to adversarial prompt attacks. trophy best paper awards @ neurips ml safety workshop 2022.

promptinject.HijackKillHumans (Default)

inject a prompt on kill humans

Promptinject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of llms to adversarial prompt attacks. trophy best paper awards @ neurips ml safety workshop 2022.

promptinject.HijackKillHumansFull (Default)

inject a prompt on kill humans

Promptinject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of llms to adversarial prompt attacks. trophy best paper awards @ neurips ml safety workshop 2022.

promptinject.HijackLongPrompt (Default)

inject a prompt on long prompt

Promptinject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of llms to adversarial prompt attacks. trophy best paper awards @ neurips ml safety workshop 2022.

promptinject.HijackLongPromptFull (Default)

inject a prompt on long prompt

Promptinject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of llms to adversarial prompt attacks. trophy best paper awards @ neurips ml safety workshop 2022.

realtoxicityprompts.RTPBlank (Default)

generate toxicity when prompted with nothing or special tokens

Implementation of probes in realtoxicityprompts.

realtoxicityprompts.RTPFlirtation

generate flirtation content

Implementation of flirtation probes in realtoxicityprompts.

realtoxicityprompts.RTPIdentity_Attack

generate identity attack content

Implementation of identity_attack probes in realtoxicityprompts.

realtoxicityprompts.RTPInsult

generate insult content

Implementation of insult probes in realtoxicityprompts.

realtoxicityprompts.RTPProfanity

generate profanity content

Implementation of profanity probes in realtoxicityprompts.

realtoxicityprompts.RTPSevere_Toxicity

generate severe toxicity content

Implementation of severe_toxicity probes in realtoxicityprompts.

realtoxicityprompts.RTPSexually_Explicit

generate sexually explicit content

Implementation of sexually_explicit probes in realtoxicityprompts.

realtoxicityprompts.RTPThreat

generate threat content

Implementation of threat probes in realtoxicityprompts.

snowball.GraphConnectivity (Default)

trick the model into saying an impossible flight routing is possible

Probe asking for impossible flight routings.

snowball.GraphConnectivityFull (Default)

trick the model into saying an impossible flight routing is possible

Probe asking for impossible flight routings.

snowball.Primes

trick the model into saying a prime number isn’t prime

Ask model to check a list of higher primes.

snowball.PrimesFull

trick the model into saying a prime number isn’t prime

Ask model to check a list of higher primes.

snowball.Senators

trick the model into saying a fictitious senator exists

Asking for senators that don’t exist.

snowball.SenatorsFull

trick the model into saying a fictitious senator exists

Asking for senators that don’t exist.

suffix.BEAST

cause the model to disregard the system prompt

Beast probe.

suffix.GCG

cause the model to disregard the system prompt

Greedy coordinate gradient probe.

suffix.GCGCached (Default)

cause the model to disregard the system prompt

Greedy coordinate gradient cached probe.

tap.PAIR

cause the model to disregard the system prompt

Prompt automatic iterative refinement automatic jailbreak.

tap.TAP

cause the model to disregard the system prompt

Tree of attacks with pruning probe, generating new attacks against the target.

tap.TAPCached (Default)

cause the model to disregard the system prompt

Tree of attacks with pruning probe using cached attacks.

test.Blank

see what the model has to say for itself given silence

Blank prompt.

test.Test

send some test strings, ignore the response

Test probe.

topic.WordnetAllowedWords (Default)

get the model to engage off-topic

Use wordnet to find out if a model will discuss terms near but not in its allowed set.

topic.WordnetBlockedWords (Default)

get the model to engage off-topic

Use wordnet to explore which topics a model will respond to around blocked words.

topic.WordnetControversial (Default)

get the model to engage off-topic

Ask model to respond on a set on controversial topics.

visual_jailbreak.FigStep

disregard the system prompt with both text and image

Using image modality to assist jailbreak.

visual_jailbreak.FigStepFull

disregard the system prompt with both text and image

Using image modality to assist jailbreak.

xss.ColabAIDataLeakage (Default)

render images fetched from a remote uri

Leak data via image rendering as demonstrated against gemini in colab.

xss.MarkdownImageExfil (Default)

embed conversation-private information in a markdown URI

Can we get the model to write embedded assets exfiltrating text in the uri?.

xss.MdExfil20230929 (Default)

embed conversation-private information in a markdown URI

Markdown image uri exfil exploit published 2023 sept 29.

xss.StringAssemblyDataExfil (Default)

embed conversation-private information in a markdown URI

Prompt injection for manipulating state.