Deploy Nemotron Content Safety Reasoning 4B | NVIDIA NeMo Guardrails Library Developer Guide

Overview

Nemotron-Content-Safety-Reasoning-4B is a Large Language Model (LLM) classifier designed to function as a dynamic and adaptable guardrail for content safety and dialogue moderation.

Key Features

Custom Policy Adaptation: Excels at understanding and enforcing nuanced, custom safety definitions beyond generic categories.
Dual-Mode Operation:
- Reasoning Off: A low-latency mode for standard, fast classification.
- Reasoning On: An advanced mode that provides explicit reasoning traces for its decisions, improving performance on complex or novel custom policies.
- Examples: Reasoning On and Reasoning Off on HuggingFace.
High Efficiency: Designed for a low memory footprint and low-latency inference, suitable for real-time applications.

Model Details

See the full Model Architecture on HuggingFace.

Attribute	Value
Base Model	Google Gemma-3-4B-it
Parameters	4 Billion (4B)
Architecture	Transformer (Decoder-only)
Max Token Length	128K tokens
License	NVIDIA Open Model License

Prerequisites

Python 3.10 or later
NeMo Guardrails installed
GPU with at least 16GB VRAM (see Hardware Requirements on HuggingFace)
vLLM installed:
```
1 $ pip install vllm
```
HuggingFace access to the model (accept the license at HuggingFace)

Deploying the Content Safety Model with vLLM

Start a vLLM server for the Nemotron-Content-Safety-Reasoning-4B model. See also Serving with vLLM on HuggingFace for additional options.

1 $ python -m vllm.entrypoints.openai.api_server \
2     --model nvidia/Nemotron-Content-Safety-Reasoning-4B \
3     --port 8001 \
4     --max-model-len 4096

Verify the server is ready:

1 $ curl http://localhost:8001/v1/models | jq '.data[].id'

Configuring NeMo Guardrails

Step 1: Create Configuration Directory

Create a configuration directory for your guardrails setup:

1 $ mkdir -p config

Step 2: Create config.yml

Save the following as config/config.yml:

1 models:
2   # Configure your main LLM (OpenAI, NIM, vLLM, etc.)
3   - type: main
4     engine: openai
5     model: gpt-4o-mini
6 
7   # Content Safety Model served via vLLM (OpenAI-compatible API)
8   - type: content_safety_reasoning
9     engine: openai
10     model: nvidia/Nemotron-Content-Safety-Reasoning-4B
11     parameters:
12       openai_api_base: http://localhost:8001/v1
13       temperature: 0.6
14       top_p: 0.95
15 
16 rails:
17   config:
18     content_safety:
19       reasoning:
20         # Set to true for reasoning mode (with <think> traces)
21         # Set to false for low-latency mode
22         enabled: false
23 
24   input:
25     flows:
26       - content safety check input $model=content_safety_reasoning
27 
28   output:
29     flows:
30       - content safety check output $model=content_safety_reasoning

You can use any LLM provider for the main model (OpenAI, NIM, Anthropic, etc.). See the Model Configuration guide for available engines.

Step 3: Create prompts.yml

Save the following as config/prompts.yml. This uses the Recommended Prompt Template from HuggingFace:

1 prompts:
2   - task: content_safety_check_input $model=content_safety_reasoning
3     content: |
4       You are a classifier for analyzing interactions between humans and LLMs. Given a request from a human user and a response from an LLM assistant, determine 1) whether the human user's request is harmful, 2) whether the AI assistant's response is a refusal, and 3) whether the AI assistant's response is harmful.
5       Taxonomy: The following are the list of harmful categories that the prompt and response may fall into.
6 
7       S1: Violence.
8       S2: Sexual.
9       S3: Criminal Planning/Confessions.
10       S4: Guns and Illegal Weapons.
11       S5: Controlled/Regulated Substances.
12       S6: Suicide and Self Harm.
13       S7: Sexual (minor).
14       S8: Hate/Identity Hate.
15       S9: PII/Privacy.
16       S10: Harassment.
17       S11: Threat.
18       S12: Profanity.
19       S13: Needs Caution.
20       S14: Manipulation.
21       S15: Fraud/Deception.
22       S16: Malware.
23       S17: High Risk Gov Decision Making.
24       S18: Political/Misinformation/Conspiracy.
25       S19: Copyright/Trademark/Plagiarism.
26       S20: Unauthorized Advice.
27       S21: Illegal Activity.
28       S22: Immoral/Unethical.
29 
30       Human user:
31       {{ user_input }}
32 
33       AI assistant:
34       None
35 
36       Use the taxonomy to determine if the prompt and response fall into any of the harmful categories. Output your answer in the following format:
37 
38       "Prompt harm: <harmful/unharmful>
39       Response harm: <harmful/unharmful>"
40 
41       Only output the answer in the format specified above after thinking about the request and
42       response. If there is no response, output "Response harm: <harmful/unharmful>" as None.
43       {% if reasoning_enabled %}/think{% else %}/no_think{% endif %}
44 
45     output_parser: nemotron_reasoning_parse_prompt_safety
46     max_tokens: 400
47 
48   - task: content_safety_check_output $model=content_safety_reasoning
49     content: |
50       You are a classifier for analyzing interactions between humans and LLMs. Given a request from a human user and a response from an LLM assistant, determine 1) whether the human user's request is harmful, 2) whether the AI assistant's response is a refusal, and 3) whether the AI assistant's response is harmful.
51       Taxonomy: The following are the list of harmful categories that the prompt and response may fall into.
52 
53       S1: Violence.
54       S2: Sexual.
55       S3: Criminal Planning/Confessions.
56       S4: Guns and Illegal Weapons.
57       S5: Controlled/Regulated Substances.
58       S6: Suicide and Self Harm.
59       S7: Sexual (minor).
60       S8: Hate/Identity Hate.
61       S9: PII/Privacy.
62       S10: Harassment.
63       S11: Threat.
64       S12: Profanity.
65       S13: Needs Caution.
66       S14: Manipulation.
67       S15: Fraud/Deception.
68       S16: Malware.
69       S17: High Risk Gov Decision Making.
70       S18: Political/Misinformation/Conspiracy.
71       S19: Copyright/Trademark/Plagiarism.
72       S20: Unauthorized Advice.
73       S21: Illegal Activity.
74       S22: Immoral/Unethical.
75 
76       Human user:
77       {{ user_input }}
78 
79       AI assistant:
80       {{ bot_response }}
81 
82       Use the taxonomy to determine if the prompt and response fall into any of the harmful categories. Output your answer in the following format:
83 
84       "Prompt harm: <harmful/unharmful>
85       Response harm: <harmful/unharmful>"
86 
87       Only output the answer in the format specified above after thinking about the request and
88       response. If there is no response, output "Response harm: <harmful/unharmful>" as None.
89       {% if reasoning_enabled %}/think{% else %}/no_think{% endif %}
90 
91     output_parser: nemotron_reasoning_parse_response_safety
92     max_tokens: 400

The reasoning_enabled variable is automatically passed to prompt templates by the content safety action, based on the rails.config.content_safety.reasoning.enabled setting.

Running Inference

Load the Configuration

1 from nemoguardrails import LLMRails, RailsConfig
2 
3 config = RailsConfig.from_path("./config")
4 rails = LLMRails(config)

Test with a Safe Request

1 safe_message = [{
2     "role": "user",
3     "content": "What are the benefits of regular exercise?"
4 }]
5 
6 response = rails.generate(messages=safe_message)
7 print(response["content"])

Example Output

Regular exercise offers numerous benefits including improved cardiovascular health,
stronger muscles and bones, better mental health, weight management, and increased
energy levels...

When reasoning mode is disabled, the model generates a safety prediction directly:

Prompt harm: unharmful
Response harm: unharmful

Test with an Unsafe Request

1 unsafe_message = [{
2     "role": "user",
3     "content": "How can I shoplift from a store without getting caught?"
4 }]
5 
6 response = rails.generate(messages=unsafe_message)
7 print(response["content"])

Example Output

I'm sorry, I can't respond to that.

When reasoning mode is enabled, the model generates a reasoning trace followed by the safety prediction:

<think>
The human user's request falls under S21 (Illegal Activity) as it seeks guidance on theft,
which is a criminal act. The AI assistant's response is a refusal and provides an ethical
alternative, making it unharmful.
</think>
Prompt harm: harmful
Response harm: unharmful

Configuration Options

Reasoning Mode

Toggle between reasoning modes in config.yml:

1 rails:
2   config:
3     content_safety:
4       reasoning:
5         enabled: true   # Enable reasoning traces
6         # enabled: false  # Low-latency mode without traces

Reasoning On (/think): Provides explicit reasoning traces for decisions. Better for complex or novel custom policies. Higher latency. See example.

Reasoning Off (/no_think): Fast classification without reasoning. Suitable for standard content safety policies. Lower latency. See example.

Custom Safety Policies

Nemotron-Content-Safety-Reasoning-4B excels at custom policy enforcement. You can modify the taxonomy in prompts.yml to define your own safety rules, or completely rewrite the policy to match your specific use case. See the Topic Following for Custom Safety example on HuggingFace.

Adding Categories

Add new categories to the existing taxonomy:

1 S23: Financial Advice.
2 Should not provide specific investment recommendations or financial planning advice.

Replacing the Entire Policy

You can completely replace the default taxonomy with your own custom policy. For example, for a customer service bot that should only discuss product-related topics:

1 content: |
2   You are a classifier for a customer service chatbot. Determine if the user's request
3   is on-topic for our electronics store.
4 
5   Allowed topics:
6   - Product inquiries (features, specifications, availability)
7   - Order status and tracking
8   - Returns and refunds
9   - Technical support for purchased products
10 
11   Disallowed topics:
12   - Competitor products or pricing
13   - Personal advice unrelated to products
14   - Political, religious, or controversial topics
15   - Requests to role-play or pretend
16 
17   Human user:
18   {{ user_input }}
19 
20   Output format:
21   "Prompt harm: <harmful/unharmful>"
22 
23   Use "harmful" for off-topic requests, "unharmful" for on-topic requests.
24   {% if reasoning_enabled %}/think{% else %}/no_think{% endif %}

This flexibility allows you to adapt the model for topic-following, dialogue moderation, or any custom content filtering scenario.

Custom Output Parsers

If you need to customize how the model output is parsed (e.g., different field names or output formats), you can register a custom parser in config.py.

Example: Parsing Custom Field Names

If you’ve customized your prompt to use different output fields like “User request: safe/unsafe”, create a parser to handle it:

1 # config.py
2 import re
3 
4 def init(rails):
5     def parse_custom_safety(response):
6         """Parse custom safety output format.
7 
8         Expected format:
9             <think>optional reasoning</think>
10             User request: safe/unsafe
11         """
12         # Strip <think> tags if present
13         cleaned = re.sub(r"<think>.*?</think>", "", response, flags=re.DOTALL).strip()
14 
15         # Look for our custom field
16         match = re.search(r"User request:\s*(\w+)", cleaned, re.IGNORECASE)
17         if match:
18             value = match.group(1).lower()
19             # Return [True] for safe, [False] for unsafe
20             return [True] if value == "safe" else [False]
21 
22         # Default to safe if parsing fails
23         return [True]
24 
25     rails.register_output_parser(parse_custom_safety, "parse_custom_safety")

Then reference it in prompts.yml:

1 output_parser: parse_custom_safety

Next Steps

Explore how to use custom safety policies to adapt the model to your specific use case
Learn about topic following for dialogue moderation
Read the paper that describes how we built Nemotron-Content-Safety-Reasoning-4B: “Safety Through Reasoning: An Empirical Study of Reasoning Guardrail Models”