Use Jailbreak Detection Heuristics
This tutorial demonstrates how to use jailbreak detection heuristics in a guardrails configuration to detect malicious prompts.
We will use the guardrails configuration for the ABC Bot defined in the topical rails tutorial.
Prerequisites
Make sure to check that the prerequisites for the ABC bot are satisfied.
- Install the
openaipackage:
- Set the
OPENAI_API_KEYenvironment variable:
- Install the following packages to test the jailbreak detection heuristics locally:
- If you’re running this inside a notebook, patch the
AsyncIOloop.
Existing Guardrails Configuration
The guardrails configuration for the ABC bot that we are using has the following input rails defined:
The self check input rail prompts an LLM model to check if the input is safe for the bot to process.
The self check input rail can be expensive to run for all input prompts, so you can use jailbreak detection heuristics as a low-latency and low-cost alternative to filter out malicious prompts.
Jailbreak Detection Heuristics
NeMo Guardrails supports jailbreak detection using a set of heuristics. Currently, two heuristics are supported:
To compute the perplexity of a string, the current implementation uses the gpt2-large model.
More information about these heuristics can be found in the Jailbreak Protection reference.
Activate Jailbreak Detection Heuristics
To activate the jailbreak detection heuristics, include the jailbreak detection heuristics flow as an input rail in your guardrails configuration.
Add the following to the config.yml file of the ABC bot:
In the same file, configure the jailbreak detection thresholds:
The thresholds for length perplexity and prefix/suffix perplexity are derived from a combination of malicious and benign prompts. More information about these thresholds can be found in the Jailbreak Protection reference.
Test the Input Rail
To test the bot with the jailbreak detection heuristics as the input rail, create an LLMRails object from the current configuration.
Then prompt the LLM with a GCG-style message and check the response.
The response returned is:
To investigate which rails were activated, use the log parameter for the generation options.
You can also print all LLM calls that were made to generate the response.
The logs indicate that the jailbreak detection heuristics rail was activated and no LLM calls were made.
This means that the jailbreak detection heuristics were able to filter out the malicious prompt without making any LLM calls.
To test the bot with a benign prompt, use the following message:
The response returned is:
We see that the prompt was not filtered out by the jailbreak detection heuristics and the response was generated by the bot.
Use the Heuristics Server in Production
For production deployments, run the jailbreak detection heuristics server separately.
The server listens on port 1337 by default. See Jailbreak Detection Heuristics with Docker for the Docker build and run commands.
After the server is running, configure the guardrails configuration to use it: