Deploy Nemotron Content Safety Reasoning 4B

View as Markdown

Overview

Nemotron-Content-Safety-Reasoning-4B is a Large Language Model (LLM) classifier designed to function as a dynamic and adaptable guardrail for content safety and dialogue moderation.

Key Features

  • Custom Policy Adaptation: Excels at understanding and enforcing nuanced, custom safety definitions beyond generic categories.

  • Dual-Mode Operation:

    • Reasoning Off: A low-latency mode for standard, fast classification.
    • Reasoning On: An advanced mode that provides explicit reasoning traces for its decisions, improving performance on complex or novel custom policies.
    • Examples: Reasoning On and Reasoning Off on HuggingFace.
  • High Efficiency: Designed for a low memory footprint and low-latency inference, suitable for real-time applications.

Model Details

See the full Model Architecture on HuggingFace.

AttributeValue
Base ModelGoogle Gemma-3-4B-it
Parameters4 Billion (4B)
ArchitectureTransformer (Decoder-only)
Max Token Length128K tokens
LicenseNVIDIA Open Model License

Prerequisites

  • Python 3.10 or later

  • NeMo Guardrails installed

  • GPU with at least 16GB VRAM (see Hardware Requirements on HuggingFace)

  • vLLM installed:

    1$ pip install vllm
  • HuggingFace access to the model (accept the license at HuggingFace)

Deploying the Content Safety Model with vLLM

Start a vLLM server for the Nemotron-Content-Safety-Reasoning-4B model. See also Serving with vLLM on HuggingFace for additional options.

1$ python -m vllm.entrypoints.openai.api_server \
2 --model nvidia/Nemotron-Content-Safety-Reasoning-4B \
3 --port 8001 \
4 --max-model-len 4096

Verify the server is ready:

1$ curl http://localhost:8001/v1/models | jq '.data[].id'

Configuring NeMo Guardrails

Step 1: Create Configuration Directory

Create a configuration directory for your guardrails setup:

1$ mkdir -p config

Step 2: Create config.yml

Save the following as config/config.yml:

1models:
2 # Configure your main LLM (OpenAI, NIM, vLLM, etc.)
3 - type: main
4 engine: openai
5 model: gpt-4o-mini
6
7 # Content Safety Model served via vLLM (OpenAI-compatible API)
8 - type: content_safety_reasoning
9 engine: openai
10 model: nvidia/Nemotron-Content-Safety-Reasoning-4B
11 parameters:
12 openai_api_base: http://localhost:8001/v1
13 temperature: 0.6
14 top_p: 0.95
15
16rails:
17 config:
18 content_safety:
19 reasoning:
20 # Set to true for reasoning mode (with <think> traces)
21 # Set to false for low-latency mode
22 enabled: false
23
24 input:
25 flows:
26 - content safety check input $model=content_safety_reasoning
27
28 output:
29 flows:
30 - content safety check output $model=content_safety_reasoning

You can use any LLM provider for the main model (OpenAI, NIM, Anthropic, etc.). See the Model Configuration guide for available engines.

Step 3: Create prompts.yml

Save the following as config/prompts.yml. This uses the Recommended Prompt Template from HuggingFace:

1prompts:
2 - task: content_safety_check_input $model=content_safety_reasoning
3 content: |
4 You are a classifier for analyzing interactions between humans and LLMs. Given a request from a human user and a response from an LLM assistant, determine 1) whether the human user's request is harmful, 2) whether the AI assistant's response is a refusal, and 3) whether the AI assistant's response is harmful.
5 Taxonomy: The following are the list of harmful categories that the prompt and response may fall into.
6
7 S1: Violence.
8 S2: Sexual.
9 S3: Criminal Planning/Confessions.
10 S4: Guns and Illegal Weapons.
11 S5: Controlled/Regulated Substances.
12 S6: Suicide and Self Harm.
13 S7: Sexual (minor).
14 S8: Hate/Identity Hate.
15 S9: PII/Privacy.
16 S10: Harassment.
17 S11: Threat.
18 S12: Profanity.
19 S13: Needs Caution.
20 S14: Manipulation.
21 S15: Fraud/Deception.
22 S16: Malware.
23 S17: High Risk Gov Decision Making.
24 S18: Political/Misinformation/Conspiracy.
25 S19: Copyright/Trademark/Plagiarism.
26 S20: Unauthorized Advice.
27 S21: Illegal Activity.
28 S22: Immoral/Unethical.
29
30 Human user:
31 {{ user_input }}
32
33 AI assistant:
34 None
35
36 Use the taxonomy to determine if the prompt and response fall into any of the harmful categories. Output your answer in the following format:
37
38 "Prompt harm: <harmful/unharmful>
39 Response harm: <harmful/unharmful>"
40
41 Only output the answer in the format specified above after thinking about the request and
42 response. If there is no response, output "Response harm: <harmful/unharmful>" as None.
43 {% if reasoning_enabled %}/think{% else %}/no_think{% endif %}
44
45 output_parser: nemotron_reasoning_parse_prompt_safety
46 max_tokens: 400
47
48 - task: content_safety_check_output $model=content_safety_reasoning
49 content: |
50 You are a classifier for analyzing interactions between humans and LLMs. Given a request from a human user and a response from an LLM assistant, determine 1) whether the human user's request is harmful, 2) whether the AI assistant's response is a refusal, and 3) whether the AI assistant's response is harmful.
51 Taxonomy: The following are the list of harmful categories that the prompt and response may fall into.
52
53 S1: Violence.
54 S2: Sexual.
55 S3: Criminal Planning/Confessions.
56 S4: Guns and Illegal Weapons.
57 S5: Controlled/Regulated Substances.
58 S6: Suicide and Self Harm.
59 S7: Sexual (minor).
60 S8: Hate/Identity Hate.
61 S9: PII/Privacy.
62 S10: Harassment.
63 S11: Threat.
64 S12: Profanity.
65 S13: Needs Caution.
66 S14: Manipulation.
67 S15: Fraud/Deception.
68 S16: Malware.
69 S17: High Risk Gov Decision Making.
70 S18: Political/Misinformation/Conspiracy.
71 S19: Copyright/Trademark/Plagiarism.
72 S20: Unauthorized Advice.
73 S21: Illegal Activity.
74 S22: Immoral/Unethical.
75
76 Human user:
77 {{ user_input }}
78
79 AI assistant:
80 {{ bot_response }}
81
82 Use the taxonomy to determine if the prompt and response fall into any of the harmful categories. Output your answer in the following format:
83
84 "Prompt harm: <harmful/unharmful>
85 Response harm: <harmful/unharmful>"
86
87 Only output the answer in the format specified above after thinking about the request and
88 response. If there is no response, output "Response harm: <harmful/unharmful>" as None.
89 {% if reasoning_enabled %}/think{% else %}/no_think{% endif %}
90
91 output_parser: nemotron_reasoning_parse_response_safety
92 max_tokens: 400

The reasoning_enabled variable is automatically passed to prompt templates by the content safety action, based on the rails.config.content_safety.reasoning.enabled setting.

Running Inference

Load the Configuration

1from nemoguardrails import LLMRails, RailsConfig
2
3config = RailsConfig.from_path("./config")
4rails = LLMRails(config)

Test with a Safe Request

1safe_message = [{
2 "role": "user",
3 "content": "What are the benefits of regular exercise?"
4}]
5
6response = rails.generate(messages=safe_message)
7print(response["content"])

Example Output

Regular exercise offers numerous benefits including improved cardiovascular health,
stronger muscles and bones, better mental health, weight management, and increased
energy levels...

When reasoning mode is disabled, the model generates a safety prediction directly:

Prompt harm: unharmful
Response harm: unharmful

Test with an Unsafe Request

1unsafe_message = [{
2 "role": "user",
3 "content": "How can I shoplift from a store without getting caught?"
4}]
5
6response = rails.generate(messages=unsafe_message)
7print(response["content"])

Example Output

I'm sorry, I can't respond to that.

When reasoning mode is enabled, the model generates a reasoning trace followed by the safety prediction:

<think>
The human user's request falls under S21 (Illegal Activity) as it seeks guidance on theft,
which is a criminal act. The AI assistant's response is a refusal and provides an ethical
alternative, making it unharmful.
</think>
Prompt harm: harmful
Response harm: unharmful

Configuration Options

Reasoning Mode

Toggle between reasoning modes in config.yml:

1rails:
2 config:
3 content_safety:
4 reasoning:
5 enabled: true # Enable reasoning traces
6 # enabled: false # Low-latency mode without traces

Reasoning On (/think): Provides explicit reasoning traces for decisions. Better for complex or novel custom policies. Higher latency. See example.

Reasoning Off (/no_think): Fast classification without reasoning. Suitable for standard content safety policies. Lower latency. See example.

Custom Safety Policies

Nemotron-Content-Safety-Reasoning-4B excels at custom policy enforcement. You can modify the taxonomy in prompts.yml to define your own safety rules, or completely rewrite the policy to match your specific use case. See the Topic Following for Custom Safety example on HuggingFace.

Adding Categories

Add new categories to the existing taxonomy:

1S23: Financial Advice.
2Should not provide specific investment recommendations or financial planning advice.

Replacing the Entire Policy

You can completely replace the default taxonomy with your own custom policy. For example, for a customer service bot that should only discuss product-related topics:

1content: |
2 You are a classifier for a customer service chatbot. Determine if the user's request
3 is on-topic for our electronics store.
4
5 Allowed topics:
6 - Product inquiries (features, specifications, availability)
7 - Order status and tracking
8 - Returns and refunds
9 - Technical support for purchased products
10
11 Disallowed topics:
12 - Competitor products or pricing
13 - Personal advice unrelated to products
14 - Political, religious, or controversial topics
15 - Requests to role-play or pretend
16
17 Human user:
18 {{ user_input }}
19
20 Output format:
21 "Prompt harm: <harmful/unharmful>"
22
23 Use "harmful" for off-topic requests, "unharmful" for on-topic requests.
24 {% if reasoning_enabled %}/think{% else %}/no_think{% endif %}

This flexibility allows you to adapt the model for topic-following, dialogue moderation, or any custom content filtering scenario.

Custom Output Parsers

If you need to customize how the model output is parsed (e.g., different field names or output formats), you can register a custom parser in config.py.

Example: Parsing Custom Field Names

If you’ve customized your prompt to use different output fields like “User request: safe/unsafe”, create a parser to handle it:

1# config.py
2import re
3
4def init(rails):
5 def parse_custom_safety(response):
6 """Parse custom safety output format.
7
8 Expected format:
9 <think>optional reasoning</think>
10 User request: safe/unsafe
11 """
12 # Strip <think> tags if present
13 cleaned = re.sub(r"<think>.*?</think>", "", response, flags=re.DOTALL).strip()
14
15 # Look for our custom field
16 match = re.search(r"User request:\s*(\w+)", cleaned, re.IGNORECASE)
17 if match:
18 value = match.group(1).lower()
19 # Return [True] for safe, [False] for unsafe
20 return [True] if value == "safe" else [False]
21
22 # Default to safe if parsing fails
23 return [True]
24
25 rails.register_output_parser(parse_custom_safety, "parse_custom_safety")

Then reference it in prompts.yml:

1output_parser: parse_custom_safety

Next Steps

  • Explore how to use custom safety policies to adapt the model to your specific use case
  • Learn about topic following for dialogue moderation
  • Read the paper that describes how we built Nemotron-Content-Safety-Reasoning-4B: “Safety Through Reasoning: An Empirical Study of Reasoning Guardrail Models”