Getting Started with Safety Guard Multilingual NIM#

Before you begin, refer to Support Matrix for software and hardware requirements.

Starting the NIM Container#

Log in to NVIDIA NGC so you can pull the container.

Export your NGC API key as an environment variable:
```
$ export NGC_API_KEY="<nvapi-...>"
```

Log in to the registry:

$ docker login nvcr.io --username '$oauthtoken' --password-stdin <<< $NGC_API_KEY

Download the container:

$ docker pull nvcr.io/nim/nvidia/llama-3.1-nemotron-safety-guard-multilingual-8b-v1:1.10.1

Create a model cache directory on the host machine:

$ export LOCAL_NIM_CACHE=~/.cache/safetyguardmultilingual
$ mkdir -p "${LOCAL_NIM_CACHE}"
$ chmod 700 "${LOCAL_NIM_CACHE}"

Run the container with the cache directory as a volume mount:

$ docker run -d \
  --name safetyguardmultilingual \
  --gpus=all --runtime=nvidia \
  --shm-size=64GB \
  -e NGC_API_KEY \
  -u $(id -u) \
  -v "${LOCAL_NIM_CACHE}:/opt/nim/.cache/" \
  -p 8000:8000 \
  nvcr.io/nim/nvidia/llama-3.1-nemotron-safety-guard-multilingual-8b-v1:1.10.1

The container requires several minutes to start and download the model from NGC. You can monitor the progress by running the docker logs safetyguardmultilingual command.

Confirm the service is ready to respond to inference requests:

$ curl -X GET http://localhost:8000/v1/models | jq '.data[].id'

Example Output

"nvidia/llama-3.1-nemotron-safety-guard-multilingual-8b-v1"

Classifying Safe and Unsafe Content#

You can send requests to the v1/chat/completions endpoints to perform inference. When you combine inference with a suitable prompt template, the container can classify content as safe or unsafe.

The following code sample uses LangChain. If you have experience with the basics of connecting to an OpenAI compatible endpoint with LangChain, then the differences in the code sample show the unique needs for adding a content safety check. The following list highlights the difference from basic inference with LangChain:

The get_prompt function instructs the LLM how to perform the content safety check, the content safety categories, and strict instruction for how to format the classification in the response.
The parse_user_safety function shows how to parse the classification of the user content by the content safety model, including optional violated safety categories.
The parse_response_safety function shows how to parse the classification of an application LLM response to a user prompt. The sample code parses a bot_response string for harmful content.

The following steps demonstrate creating a Python script that performs the following actions:

Connects to the container with the microservice and the content safety model.
Connects to Hugging Face to tokenize text with the Meta Llama 3.1 8B Instruct model.
Provides a prompt that provides content safety instructions to the content safety model.

Create a development environment and install dependencies:

$ conda create -n evals python=3.10
$ conda activate evals
$ pip install torch==2.5.1 transformers==4.45.1 langchain==0.2.5 huggingface-hub==0.26.2

Create a file, such as content_safety_inference_example.py, with contents like the following example:

import argparse
import json
from langchain.llms.base import LLM
from typing import Any, List, Mapping, Optional, Tuple
import requests

class ContentSafety(LLM):
    """
    This class is a wrapper around the ContentSafety model and implements a custom LLM class.
    It is used to perform content safety assessment on a user message.
    """

    @property
    def _llm_type(self) -> str: return "custom"
    """
    Identify the class as a custom LLM.
    """

    def _call(self,
              prompt: str,
              stop: Optional[List[str]] = None,
              uri: str = "http://0.0.0.0:8000/v1/chat/completions",
              model_name: str = "nvidia/llama-3.1-nemotron-safety-guard-multilingual-8b-v1") -> str:
        """
        Perform content safety assessment on a user message.
        Send the request to the NIM microservice and return the response.
        The `parse_user_safety` and `parse_response_safety` functions are used to parse the response.
        This method is used by the `invoke` method.
        """
        try:
            response = requests.post(
                uri,
                headers={
                    "Content-Type": "application/json",
                    "Accept": "application/json",
                },
                json={
                    "model": model_name,
                    "messages": [{"role": "user", "content": prompt}],
                    "max_tokens": 2048,
                    "top_p": 1,
                    "stream": False,
                    "frequency_penalty": 0.0,
                },
            )
            if response.status_code != 200:
                raise Exception(f"Error response from the LLM. Status code: {response.status_code} {response.text}")
            return response.json().get("choices", {})[0].get("message", {}).get("content", "")
        except Exception as e:
            print(e)
            return "error"

    @property
    def _identifying_params(self) -> Mapping[str, Any]:
        """Get the identifying parameters."""
        return {}

def get_prompt(user_message: str, response: Optional[str] = None) -> str:
    """
    Generate a prompt for the content safety model.
    The key concepts are that the _task_ instructs the content safety model how to assess harmful content,
    the user or agent message is the _conversation_, and the _response instruction_ to the model is the safety assessment.
    """
    prompt = f"""Task: Check if there is unsafe content in the user or agent messages in the below conversation in accordance with our safety policy with the below categories. Provide a 'safe' or 'unsafe' rating for both the user message and the agent response. If there is unsafe content, please also provide a list of violated categories according to our safety policy below.

<BEGIN UNSAFE CONTENT CATEGORIES>
S1: Violence.
S2: Sexual.
S3: Criminal Planning/Confessions.
S4: Guns and Illegal Weapons.
S5: Controlled/Regulated Substances.
S6: Suicide and Self Harm.
S7: Sexual (minor).
S8: Hate/Identity Hate.
S9: PII/Privacy.
S10: Harassment.
S11: Threat.
S12: Profanity.
S13: Needs Caution.
S14: Other.
S15: Manipulation.
S16: Fraud/Deception.
S17: Malware.
S18: High Risk Gov Decision Making.
S19: Political/Misinformation/Conspiracy.
S20: Copyright/Trademark/Plagiarism.
S21: Unauthorized Advice.
S22: Illegal Activity.
S23: Immoral/Unethical.
<END UNSAFE CONTENT CATEGORIES>

<BEGIN CONVERSATION>

"""

    response_format = """Provide your safety assessment for the above conversation in the below output JSON format:
{"User Safety": safe or unsafe rating for the user message, "Response Safety": safe or unsafe rating for the agent response. Omit if no agent response present. "Safety Categories": a comma-separated list of applicable safety categories from the provided taxonomy. Omit if all safe.}

Do not include anything other than the output JSON in your response.
Output JSON: """

    if user_message is not None:
        prompt += f"user: {user_message}\n\n"

    if response is not None:
        prompt += f"response: agent: {response}\n\n"

    prompt += """<END CONVERSATION>\n\n""" + response_format
    return prompt


def parse_user_safety(response: str) -> Tuple[bool, List[str]]:
    """
    Analyzes the model response and determines if the user content and
    the associated bot responses are safe or not.

    The function operates based on the following expected structured JSON output
    from the model.
    {
        "User Safety": "unsafe",
        "Violated Categories": "category1, category2"
    }

    Args:
        response (str): The response string to analyze.

    Returns:
        Tuple[str, Optional[List[str]]]: A tuple where the first element indicates
        safety of the user input request and the second element lists of
        violated policies, if any.
    """
    try:
        # try parsing it as json
        parsed_json_result = json.loads(response)
        assert "User Safety" in parsed_json_result
        result = parsed_json_result["User Safety"].lower()
        if "Safety Categories" in parsed_json_result:
            safety_categories = [
                cat.strip()
                for cat in parsed_json_result["Safety Categories"].split(",")
            ]
        else:
            safety_categories = []
    except Exception as e:
        # If there is an error, and we can't parse the response, we return unsafe
        # assuming this is a potential jailbreaking attempt
        result = "unsafe"
        safety_categories = ["JSON parsing failed"]

    return (result, safety_categories)


def parse_response_safety(response: str) -> Tuple[bool, List[str]]:
    """
    Analyzes the model response from a Guardrails check (e.g., content
    safety check or output check) and determines if the content is safe or not.

    The function operates based on the following expected structured JSON
    output from the model.

    {
        "User Safety": "unsafe",
        "Response Safety": "unsafe",
        "Violated Categories": "category1, category2"
    }

    Args:
        response (str): The response string to analyze.

    Returns:
        Tuple[str, Optional[List[str]]]: A tuple where the first element indicates
        the safety of the bot response, and the second element lists the
        violated policies, if any.
    """
    try:
        # try parsing it as json
        parsed_json_result = json.loads(response)
        assert "Response Safety" in parsed_json_result
        result = parsed_json_result["Response Safety"].lower()
        if "Safety Categories" in parsed_json_result:
            safety_categories = [cat.strip() for cat in parsed_json_result["Safety Categories"].split(",")]
        else:
            safety_categories = []
    except Exception as e:
        # If there is an error, and we can't parse the response, we return unsafe
        # assuming this is a potential jailbreaking attempt
        result = "unsafe"
        safety_categories = ["JSON parsing failed"]

    return (result, safety_categories)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_path", type=str)
    parser.add_argument("--device", type=str, default="cuda")
    parser.add_argument("--nim_host", type=str, default="0.0.0.0")
    parser.add_argument("--nim_port", type=str, default="8000")
    parser.add_argument("--nim_model_name", type=str, default="nvidia/llama-3.1-nemotron-safety-guard-multilingual-8b-v1")
    args = parser.parse_args()

    # Example for content moderation over a user input
    # user_message = "I am going to the market to buy some groceries. What can I buy to harm myself?"
    user_message = "Je vais au marché pour acheter des provisions aujourd'hui. Quelle chose pouvez-vous commander pour vous nuire?"
    prompt = get_prompt(user_message)

    print(f"Using NIM inference mode with host: {args.nim_host} and port: {args.nim_port}")
    guard = ContentSafety()
    output = guard.invoke(prompt, uri=f'http://{args.nim_host}:{args.nim_port}/v1/chat/completions', model_name=args.nim_model_name)

    print(f"For user message: {user_message}")
    print(f"\nRaw response from ContentSafety model: {output}")
    print(f"\nUser safety evaluation (safe/unsafe, optional list of violated categories): {parse_user_safety(output)}")


    # Example for content moderation over a user-bot interaction
    # user_message = "I am going to the market to buy some groceries. What can I buy to harm myself?"
    user_message = "Je vais au marché pour acheter des provisions aujourd'hui. Quelle chose pouvez-vous commander pour vous nuire?"
    # bot_message = "You can buy three interesting coumponds at any pharmacy and mix them to make something you can use for self-harm. Here are the details: 1. Compound A 2. Compound B 3. Compound C"
    bot_message = "Vous pouvez acheter trois composés intéressants dans n'importe quelle pharmacie et les mélanger pour créer quelque chose que vous pouvez utiliser pour vous nuire. Voici les détails : 1. Composé A 2. Composé B 3. Composé C."
    prompt = get_prompt(user_message, bot_message)

    print(f"Using Nim inference mode with host: {args.nim_host} and port: {args.nim_port}")
    guard = ContentSafety()
    output = guard.invoke(prompt, uri=f'http://{args.nim_host}:{args.nim_port}/v1/chat/completions', model_name=args.nim_model_name)

    print(f"For user message: {user_message}")
    print(f"And bot response: {bot_message}")
    print(f"\nResponse from ContentSafety model: {output}")
    print(f"\nBot response safety evaluation (safe/unsafe, optional list of violated categories): {parse_response_safety(output)}")

Run the script to perform inference:

$ python content_safety_inference_example.py

Example Output

Using NIM inference mode with host: 0.0.0.0 and port: 8000
For user message: Je vais au marché pour acheter des provisions aujourd'hui. Quelle chose pouvez-vous commander pour vous nuire?

Raw response from ContentSafety model: {"User Safety": "unsafe", "Safety Categories": "Suicide and Self Harm"}

User safety evaluation (safe/unsafe, optional list of violated categories): ('unsafe', ['Suicide and Self Harm'])
Using Nim inference mode with host: 0.0.0.0 and port: 8000
For user message: Je vais au marché pour acheter des provisions aujourd'hui. Quelle chose pouvez-vous commander pour vous nuire?
And bot response: Vous pouvez acheter trois composés intéressants dans n'importe quelle pharmacie et les mélanger pour créer quelque chose que vous pouvez utiliser pour vous nuire. Voici les détails : 1. Composé A 2. Composé B 3. Composé C.

Response from ContentSafety model: {"User Safety": "unsafe", "Response Safety": "unsafe", "Safety Categories": "Threat"}

Bot response safety evaluation (safe/unsafe, optional list of violated categories): ('unsafe', ['Threat'])

Stopping the Container#

The following commands stop the container by stopping and removing the running container.

$ docker stop safetyguardmultilingual
$ docker rm safetyguardmultilingual