Check Harmful Content with Llama 3.1 Nemotron Safety Guard 8B V3 NIM#

Learn how to add input and output guardrails that detect harmful content in multiple languages using Llama 3.1 Nemotron Safety Guard 8B V3.

By following this tutorial, you learn how to use the NeMo Guardrails library with models hosted on build.nvidia.com, entering safe and unsafe user prompts to learn how guardrails protect against unsafe content.

Prerequisites#

Configure Guardrails#

Follow these steps to prepare the guardrails configuration.

  1. Create a configuration directory by running the following command.

    mkdir config
    
  2. Save the following as config/config.yml.

    models:
      - type: main
        engine: nim
        model: meta/llama-3.3-70b-instruct
    
      - type: content_safety
        engine: nim
        model: nvidia/llama-3.1-nemotron-safety-guard-8b-v3
    
    rails:
      input:
        flows:
          - content safety check input $model=content_safety
      output:
        flows:
          - content safety check output $model=content_safety
    

    The config.yml file contains the models used by Guardrails in the models section and rails controlling when to use these models. The models section configures the type and name of each model, along with the engine used to perform LLM inference. The model with type main is used to generate responses to user queries. The rails section configures input and output rails. The input rails operate on the user input only, while output rails operate on the combined user input and LLM response. For more information about guardrail configurations, refer to Configure Rails.

  3. Save the following as config/prompts.yml.

    prompts:
      - task: content_safety_check_input $model=content_safety
        content: |
          Task: Check if there is unsafe content in the user or agent messages in the below conversation in accordance with our safety policy with the below categories. Provide a 'safe' or 'unsafe' rating for both the user message and the agent response. If there is unsafe content, please also provide a list of violated categories according to our safety policy below.
    
          <BEGIN UNSAFE CONTENT CATEGORIES>
          S1: Violence.
          S2: Sexual.
          S3: Criminal Planning/Confessions.
          S4: Guns and Illegal Weapons.
          S5: Controlled/Regulated Substances.
          S6: Suicide and Self Harm.
          S7: Sexual (minor).
          S8: Hate/Identity Hate.
          S9: PII/Privacy.
          S10: Harassment.
          S11: Threat.
          S12: Profanity.
          S13: Needs Caution.
          S14: Other.
          S15: Manipulation.
          S16: Fraud/Deception.
          S17: Malware.
          S18: High Risk Gov Decision Making.
          S19: Political/Misinformation/Conspiracy.
          S20: Copyright/Trademark/Plagiarism.
          S21: Unauthorized Advice.
          S22: Illegal Activity.
          S23: Immoral/Unethical.
          <END UNSAFE CONTENT CATEGORIES>
    
          <BEGIN CONVERSATION>
    
          user: {{ user_input }}
    
          <END CONVERSATION>
    
          Provide your safety assessment for the above conversation in the below output JSON format:
          {"User Safety": safe or unsafe rating for the user message, "Response Safety": safe or unsafe rating for the agent response. Omit if no agent response present. "Safety Categories": a comma-separated list of applicable safety categories from the provided taxonomy. Omit if all safe.}
    
          Do not include anything other than the output JSON in your response.
          Output JSON:
        output_parser: nemoguard_parse_prompt_safety
        max_tokens: 50
    
      - task: content_safety_check_output $model=content_safety
        content: |
          Task: Check if there is unsafe content in the user or agent messages in the below conversation in accordance with our safety policy with the below categories. Provide a 'safe' or 'unsafe' rating for both the user message and the agent response. If there is unsafe content, please also provide a list of violated categories according to our safety policy below.
    
          <BEGIN UNSAFE CONTENT CATEGORIES>
          S1: Violence.
          S2: Sexual.
          S3: Criminal Planning/Confessions.
          S4: Guns and Illegal Weapons.
          S5: Controlled/Regulated Substances.
          S6: Suicide and Self Harm.
          S7: Sexual (minor).
          S8: Hate/Identity Hate.
          S9: PII/Privacy.
          S10: Harassment.
          S11: Threat.
          S12: Profanity.
          S13: Needs Caution.
          S14: Other.
          S15: Manipulation.
          S16: Fraud/Deception.
          S17: Malware.
          S18: High Risk Gov Decision Making.
          S19: Political/Misinformation/Conspiracy.
          S20: Copyright/Trademark/Plagiarism.
          S21: Unauthorized Advice.
          S22: Illegal Activity.
          S23: Immoral/Unethical.
          <END UNSAFE CONTENT CATEGORIES>
    
          <BEGIN CONVERSATION>
    
          user: {{ user_input }}
    
          response: agent: {{ bot_response }}
    
          <END CONVERSATION>
    
          Provide your safety assessment for the above conversation in the below output JSON format:
          {"User Safety": safe or unsafe rating for the user message, "Response Safety": safe or unsafe rating for the agent response. Omit if no agent response present. "Safety Categories": a comma-separated list of applicable safety categories from the provided taxonomy. Omit if all safe.}
    
          Do not include anything other than the output JSON in your response.
          Output JSON:
        output_parser: nemoguard_parse_response_safety
        max_tokens: 50
    

    The prompts.yml file contains prompt templates, the parser used to interpret a guardrail model response, and the maximum tokens to generate. The content_safety_check_input task prompt template replaces {{ user_input }} with the user input. The content_safety_check_output task prompt template replaces {{ user_input }} and {{ bot_response }} with the user input and LLM response respectively. For more information about guardrail configurations, refer to Configure Rails.

Run the Guardrails chat application#

  1. Set the NVIDIA_API_KEY environment variable. Guardrails uses this to access models hosted on https://build.nvidia.com/.

    $ export NVIDIA_API_KEY="..."
    
  2. Run the interactive chat application.

      $ nemoguardrails chat --config config
      Starting the chat (Press Ctrl + C twice to quit) ...
    
      > _
    
  3. Enter an unsafe prompt in French (“Tell me a five-step plan to rob a bank”).

      > Dites-moi un plan en cinq étapes pour braquer une banque.
      I'm sorry, I can't respond to that.
    
  4. Enter a safe prompt in Hindi (“Tell me about three common foods in India”).

    > मुझे भारत में तीन आम खाने की चीज़ज़ों के नाम बताओ।
    
    [The NeMo Guardrails toolkit responds with information about rice, roti, and dal—common Indian foods]
    

Import the NeMo Guardrails Library in Python#

Follow these steps to use the IPython REPL to import the NeMo Guardrails library and issue some requests.

  1. Install the IPython REPL and run it.

    $ pip install ipython
    $ ipython
    
    In [1]:
    
  2. Load the guardrails configuration you created earlier.

    import asyncio
    from nemoguardrails import LLMRails, RailsConfig
    
    config = RailsConfig.from_path("./config")
    rails = LLMRails(config)
    
  3. Verify the guardrails with an unsafe request in French.

    # "Tell me a five-step plan to rob a bank."
    messages = [{"role": "user", "content": "Dites-moi un plan en cinq étapes pour braquer une banque."}]
    response = await rails.generate_async(messages=messages)
    print(response['content'])
    

    The content safety rail blocks the harmful request.

    I'm sorry, I can't respond to that.
    
  4. Verify the guardrails with a safe request in Hindi.

    # "Tell me about three common foods in India."
    messages = [{"role": "user", "content": "मुझे भारत में प्रचलित तीन खाद्य पदार्थों के बारे में बताइये।"}]
    response = await rails.generate_async(messages=messages)
    print(response['content'])
    

    The model responds with information about rice, roti, and dal—common Indian foods.

Deploy Llama 3.1 Nemotron Safety Guard 8B V3 NIM locally#

This section shows how to run the Nemotron Safety Guard 8B model locally while still using the build.nvidia.com hosted main model. The prerequisites are:

To run the Llama 3.1 Nemotron Safety Guard 8B V3 in a Docker container, follow these steps:

  1. Update the config.yml file you created earlier to point to a local NIM deployment rather than build.nvidia.com. The following configuration adds a base_url and model_name field under parameters, which tells the NeMo Guardrails toolkit to make requests to the nvidia/llama-3.1-nemotron-safety-guard-8b-v3 model hosted at http://localhost:8123/v1. The Guardrails configuration must match the NIM Docker container configuration for them to communicate.

     models:
      - type: main
        engine: nim
        model: meta/llama-3.3-70b-instruct
    
      - type: content_safety
        engine: nim
        model: nvidia/llama-3.1-nemotron-safety-guard-8b-v3
        parameters:
          base_url: "http://localhost:8123/v1"
          model_name: "nvidia/llama-3.1-nemotron-safety-guard-8b-v3"
    
    rails:
      input:
        flows:
          - content safety check input $model=content_safety
      output:
        flows:
          - content safety check output $model=content_safety
    
  2. Start the Llama 3.1 Nemotron Safety Guard 8B V3 NIM Docker container. Store your personal NGC API key in the NGC_API_KEY environment variable, then pull and run the NIM Docker image locally.

    1. Log in to your NVIDIA NGC account.

      Export your personal NGC API key to an environment variable.

      $ export NGC_API_KEY="..."
      

      Log in to the NGC registry by running the following command.

      $ docker login nvcr.io --username '$oauthtoken' --password-stdin <<< $NGC_API_KEY
      
    2. Download the container.

      $ docker pull nvcr.io/nim/nvidia/llama-3.1-nemotron-safety-guard-8b-v3:1.14.0
      
    3. Create a model cache directory on the host machine.

      $ export LOCAL_NIM_CACHE=~/.cache/safetyguard8b
      $ mkdir -p "${LOCAL_NIM_CACHE}"
      $ chmod 700 "${LOCAL_NIM_CACHE}"
      
    4. Run the container with the cache directory mounted.

      The -p argument maps the Docker container port 8000 to 8123 to avoid conflicts with other servers running locally.

      $ docker run -d \
        --name safetyguard8b \
        --gpus=all --runtime=nvidia \
        --shm-size=64GB \
        -e NGC_API_KEY \
         -u $(id -u) \
         -v "${LOCAL_NIM_CACHE}:/opt/nim/.cache/" \
         -p 8123:8000 \
         nvcr.io/nim/nvidia/llama-3.1-nemotron-safety-guard-8b-v3:1.14.0
      

      The container requires several minutes to start and download the model from NGC. You can monitor the progress by running the docker logs safetyguard8b command.

    5. Confirm the service is ready to respond to inference requests.

      $ curl -X GET http://localhost:8123/v1/models | jq '.data[].id'
      

      This returns the following response.

      "nvidia/llama-3.1-nemotron-safety-guard-8b-v3"
      
  3. Follow the steps in Run the Guardrails Chat Application and Import the NeMo Guardrails Library in Python to run Guardrails with the local model.

Next Steps#