Content Safety
The content safety checks inside Guardrails act as a robust set of guardrails designed to ensure the integrity and safety of both input and output text. This feature allows users to utilize a variety of advanced content safety models such as Nvidia’s Nemotron Content Safety model, Meta’s Llama Guard 3, and Google’s ShieldGemma.
To use the content safety check, you should:
-
Include the desired content safety models in the models section of the
config.ymlfile:The
typeis a unique identifier for the model that will be passed to the input and output rails as a parameter.The vLLM example above uses NeMo Guardrails’ built-in OpenAI-compatible client. Because vLLM exposes an OpenAI-compatible API,
engine: openaiplusparameters.base_urlreaches it directly with no LangChain dependency. The legacyengine: vllm_openaiwithparameters.openai_api_baseis only needed when running underNEMOGUARDRAILS_LLM_FRAMEWORK=langchain. For background, see Migrating to 0.22. -
Include the content safety check in the input and output rails section of the
config.ymlfile:You must define the models in the
modelssection of theconfig.ymlfile before using them in the input and output flows. Thecontent safety check inputandcontent safety check outputflows are used to check the input and output text, respectively. The$modelparameter specifies the model to be used for content safety checking. The model must be defined in themodelssection of theconfig.ymlfile. Thecontent safety check inputandcontent safety check outputflows return a boolean value indicating whether the input or output text is safe. Depending on the model, it also returns a set of policy violations. Please refer to the content safety example for more details. -
Specify the prompts for each content safety check flow in the
prompts.ymlfile, here is the example prompt for theshieldgemmamodel:If a prompt is not defined, an exception will be raised when the configuration is loaded.
-
You must specify the output parser. You can use your own parser and register it or use the off-the-shelf
is_content_safeoutput parser as shown above.This parser works by checking for specific keywords in the response:
- If the response includes “safe”, the content is considered safe.
- If the response includes “unsafe” or “yes”, the content is considered unsafe.
- If the response includes “no”, the content is considered safe.
If you’re using this function for a different task with a custom prompt, you’ll need to update the logic to fit the new context. In this case, “yes” means the content should be blocked, is unsafe, or breaks a policy, while “no” means the content is safe and doesn’t break any policies.
The above is an example prompt that you can use with the content safety check input $model=shieldgemma. The prompt has one input variable: {{ user_input }}, which includes user input that should be moderated. The completion must be “yes” if the response is not safe and “no” otherwise. Optionally, some models may return a set of policy violations.
The content safety check input and content safety check output rails executes the content_safety_check_input and content_safety_check_output actions respectively.
Reasoning Models as Content Safety Guards
Reasoning guard models such as Nemotron Content Safety Reasoning, and OpenAI gpt-oss-safeguard spend output tokens on internal reasoning before emitting the safety verdict. If the configured max_tokens is too small, the budget can be exhausted by the reasoning phase and the model returns empty content with finish_reason="length". The content safety actions log a warning in that case and continue with empty output, which the parser typically treats as unsafe.
To use a reasoning guard, set max_tokens on the corresponding prompt task in prompts.yml to a value that fits both the reasoning trace and the verdict:
If max_tokens is not set on the prompt task, the action falls back to a default of 1024 tokens. Adjust this value for the model’s expected reasoning trace length.
Multilingual Refusal Messages
When content safety rails block unsafe content, you can configure the NeMo Guardrails library to automatically detect the user’s input language and return refusal messages in that same language. This provides a better user experience for multilingual applications.
Supported Languages
The multilingual feature supports 9 languages:
If the detected language is not in this list, English is used as the fallback.
Installation
To use multilingual refusal messages, install the NeMo Guardrails library with the multilingual extra:
Usage
To enable multilingual refusal messages, add the multilingual configuration to your config.yml:
Custom Refusal Messages
You can customize the refusal messages for each language:
If a custom message is not provided for a detected language, the built-in default message for that language is used.
How It Works
When multilingual.enabled is set to true:
- The
detect_languageaction uses the fast-langdetect library to detect the language of the user’s input - If the content safety check blocks the input, the refusal message is returned in the detected language
- Language detection adds minimal latency (~12μs per request)
Cold Start Behavior
The fast-langdetect library downloads a language detection model on first use:
Default cache location:
fast-langdetect stores its downloaded FastText model in a temporary, OS-specific cache directory at {system_temp_dir}/fasttext-langdetect/, where system_temp_dir is whatever directory your operating system uses for temporary files:
- macOS: A sandboxed temp path such as
/var/folders/<random>/T/fasttext-langdetect/ - Linux: The global temp directory
/tmp/fasttext-langdetect/ - Windows: The user’s temporary directory, e.g.,
C:\Users\<User>\AppData\Local\Temp\fasttext-langdetect\
You can override this location via the FTLANG_CACHE environment variable.
Production considerations:
- First API call may take ~10-20 seconds to download and load the full model (network-dependent)
- Subsequent calls use the cached model with ~9-12μs latency
- For container/serverless environments, consider pre-warming during startup or persisting the model cache in your container image
Accuracy
Language detection accuracy was benchmarked on two datasets:
Llama Guard-based Content Moderation
TODO: is this covered by the general content safety abstraction?
The NeMo Guardrails library provides out-of-the-box support for content moderation using Meta’s Llama Guard model.
Example usage
For more details, check out the Llama-Guard Integration page.
Third-party Content Safety APIs
NeMo Guardrails integrates with a collection of third-party managed services which offer content safety guardrails. These include:
- ActiveFence
- AutoAlign
- Clavata
- GCP Text Moderation
- Guardrails AI
- Fiddler Guardrails
- Prompt Security
- Pangea (Crowdstrike) AI Guard
See the above reference pages or Third-Party APIs for more information.