Llama-Guard Integration

NeMo Guardrails provides out-of-the-box support for content moderation using Meta’s Llama Guard model.

In our testing, we observe significantly improved input and output content moderation performance compared to the self-check method. Please see the performance evaluation for benchmark numbers.

Usage

To configure your bot to use Llama Guard for input/output checking, follow the below steps:

Add a model of type llama_guard to the models section of the config.yml file. The example below serves Llama Guard with vLLM. Because vLLM exposes an OpenAI-compatible API, engine: openai plus parameters.base_url reaches it through NeMo Guardrails’ built-in client with no LangChain dependency. For background, see Migrating to 0.22.
```
1 models:
2   ...
3 
4   - type: llama_guard
5     engine: openai
6     model: meta-llama/LlamaGuard-7b
7     parameters:
8       base_url: "http://localhost:5123/v1"
9       api_key: EMPTY
```
:::{note} Set api_key: EMPTY (or any non-empty placeholder) when self-hosted vLLM does not enforce auth. If your deployment requires a real token, replace api_key: EMPTY with the literal token value, or omit api_key and set api_key_env_var at the top level of the model entry (not inside parameters:):
```
1 - type: llama_guard
2   engine: openai
3   model: meta-llama/LlamaGuard-7b
4   api_key_env_var: MY_LLAMA_GUARD_API_KEY
5   parameters:
6     base_url: "http://localhost:5123/v1"
```
:::

Include the llama guard check input and llama guard check output flow names in the rails section of the config.yml file:

1 rails:
2   input:
3     flows:
4       - llama guard check input
5   output:
6     flows:
7       - llama guard check output

Define the llama_guard_check_input and the llama_guard_check_output prompts in the prompts.yml file.

1 prompts:
2   - task: llama_guard_check_input
3     content: |
4       <s>[INST] Task: ...
5       <BEGIN UNSAFE CONTENT CATEGORIES>
6       O1: ...
7       O2: ...
8   - task: llama_guard_check_output
9     content: |
10       <s>[INST] Task: ...
11       <BEGIN UNSAFE CONTENT CATEGORIES>
12       O1: ...
13       O2: ...

The rails execute the llama_guard_check_* actions, which return True if the user input or the bot message should be allowed, and False otherwise, along with a list of the unsafe content categories as defined in the Llama Guard prompt.

define flow llama guard check input
  $llama_guard_response = execute llama_guard_check_input
  $allowed = $llama_guard_response["allowed"]
  $llama_guard_policy_violations = $llama_guard_response["policy_violations"]
  if not $allowed
    bot refuse to respond
    stop
# (similar flow for checking output)

A complete example configuration that uses Llama Guard for input and output moderation is provided in this example folder.

1	models:
2	...
3
4	- type: llama_guard
5	engine: openai
6	model: meta-llama/LlamaGuard-7b
7	parameters:
8	base_url: "http://localhost:5123/v1"
9	api_key: EMPTY

1	- type: llama_guard
2	engine: openai
3	model: meta-llama/LlamaGuard-7b
4	api_key_env_var: MY_LLAMA_GUARD_API_KEY
5	parameters:
6	base_url: "http://localhost:5123/v1"