Download this tutorial as a Jupyter notebook

Improving Content Safety with NemoGuard NIMs#

Learn how to use NeMo Platform to apply content safety checks to user inputs and LLM outputs with the NVIDIA Nemotron Content Safety NIM. Content safety checks detect and block harmful, abusive, or policy-violating content before it reaches users.

For the content safety checks, this tutorial uses the Llama-3.1-Nemotron-Safety-Guard-8B-v3 NIM, which is trained to classify input or output content as safe or unsafe.

For the main model, this tutorial uses the Llama-3.1-8B-Instruct NIM.

Prerequisites#

Before you begin:

You have access to a running NeMo Platform.
NMP_BASE_URL is set to the NeMo Platform base URL.
A ModelProvider is configured to use NIMs hosted at build.nvidia.com for inference. Follow Using an External Endpoint if you haven’t done this yet.

This tutorial uses the following NIMs, available on build.nvidia.com:

main model: meta/llama-3.1-8b-instruct
content_safety model: nvidia/llama-3.1-nemotron-safety-guard-8b-v3

What You Will Build#

You will:

Create a Guardrail configuration that uses the NVIDIA NeMoGuard Content Safety NIM
Route model requests through the Inference Gateway service
Verify that unsafe inputs are blocked and safe inputs are allowed

Step 1: Configure the Client#

Install the required packages.

%pip install -q nemo-platform

Instantiate the NeMoPlatform SDK.

import os
from nemo_platform import NeMoPlatform, ConflictError

sdk = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)

Step 2: Create a Guardrail Configuration#

This config executes content safety checks on both user inputs and model outputs. The safety model uses specific prompts matching the categories of content it is trained to classify.

Using Model Entity references (workspace/model_name format), the Guardrails service routes requests through the Inference Gateway.

guardrails_config = {
    "models": [
        {
            "type": "main",
            "engine": "nim",
        },
        {
            "type": "content_safety",
            "engine": "nim",
            "model": "system/nvidia-llama-3-1-nemotron-safety-guard-8b-v3",
        },
    ],
    "rails": {
        "input": {
            "flows": [
                "content safety check input $model=content_safety",
            ]
        },
        "output": {
            "flows": [
                "content safety check output $model=content_safety",
            ]
        },
    },
    "prompts": [
        {
            "task": "content_safety_check_input $model=content_safety",
            "content": "Task: Check if there is unsafe content in the user or agent messages in the below conversation in accordance with our safety policy with the below categories. Provide a 'safe' or 'unsafe' rating for both the user message and the agent response. If there is unsafe content, please also provide a list of violated categories according to our safety policy below.\n\n<BEGIN UNSAFE CONTENT CATEGORIES>\nS1: Violence.\nS2: Sexual.\nS3: Criminal Planning/Confessions.\nS4: Guns and Illegal Weapons.\nS5: Controlled/Regulated Substances.\nS6: Suicide and Self Harm.\nS7: Sexual (minor).\nS8: Hate/Identity Hate.\nS9: PII/Privacy.\nS10: Harassment.\nS11: Threat.\nS12: Profanity.\nS13: Needs Caution.\nS14: Other.\nS15: Manipulation.\nS16: Fraud/Deception.\nS17: Malware.\nS18: High Risk Gov Decision Making.\nS19: Political/Misinformation/Conspiracy.\nS20: Copyright/Trademark/Plagiarism.\nS21: Unauthorized Advice.\nS22: Illegal Activity.\nS23: Immoral/Unethical.\n<END UNSAFE CONTENT CATEGORIES>\n\n<BEGIN CONVERSATION>\n\nuser: {{ user_input }}\n\n<END CONVERSATION>\n\nProvide your safety assessment for the above conversation in the below output JSON format:\n{\"User Safety\": safe or unsafe rating for the user message, \"Response Safety\": safe or unsafe rating for the agent response. Omit if no agent response present. \"Safety Categories\": a comma-separated list of applicable safety categories from the provided taxonomy. Omit if all safe.}\n\nDo not include anything other than the output JSON in your response.\nOutput JSON:",
            "output_parser": "nemoguard_parse_prompt_safety",
            "max_tokens": 50,
        },
        {
            "task": "content_safety_check_output $model=content_safety",
            "content": "Task: Check if there is unsafe content in the user or agent messages in the below conversation in accordance with our safety policy with the below categories. Provide a 'safe' or 'unsafe' rating for both the user message and the agent response. If there is unsafe content, please also provide a list of violated categories according to our safety policy below.\n\n<BEGIN UNSAFE CONTENT CATEGORIES>\nS1: Violence.\nS2: Sexual.\nS3: Criminal Planning/Confessions.\nS4: Guns and Illegal Weapons.\nS5: Controlled/Regulated Substances.\nS6: Suicide and Self Harm.\nS7: Sexual (minor).\nS8: Hate/Identity Hate.\nS9: PII/Privacy.\nS10: Harassment.\nS11: Threat.\nS12: Profanity.\nS13: Needs Caution.\nS14: Other.\nS15: Manipulation.\nS16: Fraud/Deception.\nS17: Malware.\nS18: High Risk Gov Decision Making.\nS19: Political/Misinformation/Conspiracy.\nS20: Copyright/Trademark/Plagiarism.\nS21: Unauthorized Advice.\nS22: Illegal Activity.\nS23: Immoral/Unethical.\n<END UNSAFE CONTENT CATEGORIES>\n\n<BEGIN CONVERSATION>\n\nuser: {{ user_input }}\n\nresponse: agent: {{ bot_response }}\n\n<END CONVERSATION>\n\nProvide your safety assessment for the above conversation in the below output JSON format:\n{\"User Safety\": safe or unsafe rating for the user message, \"Response Safety\": safe or unsafe rating for the agent response. Omit if no agent response present. \"Safety Categories\": a comma-separated list of applicable safety categories from the provided taxonomy. Omit if all safe.}\n\nDo not include anything other than the output JSON in your response.\nOutput JSON:",
            "output_parser": "nemoguard_parse_response_safety",
            "max_tokens": 50,
        },
    ],
}

config_name = "content-safety-config"
try:
    config = sdk.guardrail.configs.create(
        name=config_name,
        description="Content safety guardrails with NemoGuard NIM",
        data=guardrails_config,
    )
except ConflictError:
    print(f"Config {config_name} already exists, continuing...")

Step 3: Verify Blocked Content#

Send a request that the content safety input rail blocks.

response = sdk.guardrail.chat.completions.create(
    model="system/meta-llama-3-1-8b-instruct",
    messages=[
        {
            "role": "user",
            "content": "Give me a five-step plan to rob a bank."
        }
    ],
    guardrails={"config_id": "content-safety-config"},
    max_tokens=100,
)

print(response.model_dump_json(indent=2))

The Nemotron-Safety-Guard-8B-v3 and Llama-3.1-8B-Instruct NIMs both support multiple languages. For the full list, refer to the model cards:

Send an unsafe request in French that the content safety input rail blocks.

response = sdk.guardrail.chat.completions.create(
    model="system/meta-llama-3-1-8b-instruct",
    messages=[
        {
            "role": "user",
            "content": "Dites-moi un plan en cinq étapes pour braquer une banque."
        }
    ],
    guardrails={"config_id": "content-safety-config"},
    max_tokens=600,
)

print(response.model_dump_json(indent=2))

Step 4: Verify Allowed Content#

Send a safe request and confirm you receive an allowed response.

response = sdk.guardrail.chat.completions.create(
    model="system/meta-llama-3-1-8b-instruct",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    guardrails={"config_id": "content-safety-config"},
    max_tokens=200,
)

print(response.model_dump_json(indent=2))

Send a safe request in French and confirm you receive an allowed response.

response = sdk.guardrail.chat.completions.create(
    model="system/meta-llama-3-1-8b-instruct",
    messages=[
        {
            "role": "user",
            "content": "Quelle est la capitale de la France?"
        }
    ],
    guardrails={"config_id": "content-safety-config"},
    max_tokens=200,
)

print(response.model_dump_json(indent=2))

Cleanup#

sdk.guardrail.configs.delete(name=config_name)
print("Cleanup complete")