Llama-Guard Integration

View as Markdown

NeMo Guardrails provides out-of-the-box support for content moderation using Meta’s Llama Guard model.

In our testing, we observe significantly improved input and output content moderation performance compared to the self-check method. Please see the performance evaluation for benchmark numbers.

Usage

To configure your bot to use Llama Guard for input/output checking, follow the below steps:

  1. Add a model of type llama_guard to the models section of the config.yml file. The example below serves Llama Guard with vLLM. Because vLLM exposes an OpenAI-compatible API, engine: openai plus parameters.base_url reaches it through NeMo Guardrails’ built-in client with no LangChain dependency. For background, see Migrating to 0.22.

    1models:
    2 ...
    3
    4 - type: llama_guard
    5 engine: openai
    6 model: meta-llama/LlamaGuard-7b
    7 parameters:
    8 base_url: "http://localhost:5123/v1"
    9 api_key: EMPTY

    :::{note} Set api_key: EMPTY (or any non-empty placeholder) when self-hosted vLLM does not enforce auth. If your deployment requires a real token, replace api_key: EMPTY with the literal token value, or omit api_key and set api_key_env_var at the top level of the model entry (not inside parameters:):

    1- type: llama_guard
    2 engine: openai
    3 model: meta-llama/LlamaGuard-7b
    4 api_key_env_var: MY_LLAMA_GUARD_API_KEY
    5 parameters:
    6 base_url: "http://localhost:5123/v1"

    :::

  2. Include the llama guard check input and llama guard check output flow names in the rails section of the config.yml file:

    1rails:
    2 input:
    3 flows:
    4 - llama guard check input
    5 output:
    6 flows:
    7 - llama guard check output
  3. Define the llama_guard_check_input and the llama_guard_check_output prompts in the prompts.yml file.

    1prompts:
    2 - task: llama_guard_check_input
    3 content: |
    4 <s>[INST] Task: ...
    5 <BEGIN UNSAFE CONTENT CATEGORIES>
    6 O1: ...
    7 O2: ...
    8 - task: llama_guard_check_output
    9 content: |
    10 <s>[INST] Task: ...
    11 <BEGIN UNSAFE CONTENT CATEGORIES>
    12 O1: ...
    13 O2: ...

The rails execute the llama_guard_check_* actions, which return True if the user input or the bot message should be allowed, and False otherwise, along with a list of the unsafe content categories as defined in the Llama Guard prompt.

define flow llama guard check input
$llama_guard_response = execute llama_guard_check_input
$allowed = $llama_guard_response["allowed"]
$llama_guard_policy_violations = $llama_guard_response["policy_violations"]
if not $allowed
bot refuse to respond
stop
# (similar flow for checking output)

A complete example configuration that uses Llama Guard for input and output moderation is provided in this example folder.