NeMo Guardrails Support in NVIDIA RAG Blueprint#
This guide provides step-by-step instructions to enable NeMo Guardrails for the NVIDIA RAG Blueprint, enabling you to control and safeguard LLM interactions.
Warning
B200 GPUs are not supported for NeMo Guardrails for guardrails at input/output. For this feature, use H100 or A100 GPUs instead.
NeMo Guardrails is a framework that provides safety and security measures for LLM applications. When enabled, it provides:
Content safety filtering
Topic control to prevent off-topic conversations
Jailbreak detection to prevent prompt attacks
Prerequisites#
Follow the prerequisites for your deployment method:
Current Limitations#
Currently, Helm-based deployment is not supported for NeMo Guardrails.
Currently, the Jailbreak detection model is not available.
User queries which attempt to jailbreak the system (asking the bot to behave in a certain way) may not work as expected in the current version. These jailbreak attempts could be better addressed with the NemoGuard-Jailbreak-Detect NIM Microservice, which currently does not offer out-of-the-box support.
Both the content-safety and topic-control models are trained on single-turn datasets, meaning they don’t handle multi-turn conversations as effectively. When the bot combines multiple queries and previous context, it may inconsistently flag certain phrases as safe or unsafe.
The current version of Guardrails is tuned to provide simple safe responses, such as “I’m sorry. I can’t respond to that.”
Hardware Requirements#
You need two extra GPUs (2 x H100 or 2 x A100) for deployment, as each model must be deployed on its own dedicated GPU - one for the Content Safety model and another for the Topic Control model.
The NeMo Guardrails models have specific hardware requirements:
Llama 3.1 NemoGuard 8B Content Safety Model: Requires 48 GB of GPU memory. Refer to Support Matrix.
Llama 3.1 NemoGuard 8B Topic Control Model: Requires 48 GB of GPU memory. Refer to Support Matrix.
NVIDIA developed and tested these microservices using H100 and A100 GPUs.
Deployment Option 1: Self-Hosted Microservices (Default)#
To deploy all guardrails services on your own dedicated hardware, use the following procedure.
The RAG Server must be running before you start NeMo Guardrails services.
Note
For self-hosted deployment, the default NIM service must be up and running. If you’re unable to run the NIM service locally, you can use NVIDIA’s cloud-hosted LLM by exporting the NIM endpoint URL.
# Use NVIDIA-hosted LLM export NIM_ENDPOINT_URL=https://integrate.api.nvidia.com/v1 # Or provide your own custom NIM endpoint URL # export NIM_ENDPOINT_URL=<your-custom-nim-endpoint-url>
If the RAG server uses a custom on-prem LLM endpoint through
APP_LLM_SERVERURL, setNIM_ENDPOINT_URLfor the Guardrails microservice to the same OpenAI-compatible base URL ending in/v1.Set the environment variable to enable guardrails by running the following code.
export ENABLE_GUARDRAILS=true export DEFAULT_CONFIG=nemoguard
After you update the environment variables, you must restart the RAG server by running the following code.
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d
Create a directory for caching models by running the following code. Ensure that you create a different one than the one used by other models of this blueprint.
mkdir -p ~/.cache/nemoguard-model-cache
Set the model directory path by running the following code.
export MODEL_DIRECTORY=~/.cache/nemoguard-model-cache
Check your available GPUs and their IDs by running the following code.
This displays all available GPUs with their IDs, memory usage, and utilization.
nvidia-smi
Use the information in the previous step to export specific GPU IDs for the guardrails services by running the following code.
# By default, the services use GPU IDs 7 and 6 # Set these to appropriate values based on your system configuration export CONTENT_SAFETY_GPU_ID=0 # Choose GPU ID for content safety model export TOPIC_CONTROL_GPU_ID=1 # Choose GPU ID for topic control model
Note
Each model requires a dedicated GPU with at least 48GB of memory. Select GPUs with sufficient available memory.
Start the NeMo Guardrails service by running the following code.
USERID=$(id -u) docker compose -f deploy/compose/docker-compose-nemo-guardrails.yaml up -d
This command starts the following services:
NeMo Guardrails microservice
Content safety model
Topic control model
(Optional) The NemoGuard services might take several minutes to fully initialize. You can monitor their status by running the following code.
watch -n 2 'docker ps --format "table {{.Names}}\t{{.Status}}" | grep -E "nemoguard|guardrails"'
You should see output similar to the following. Wait until all services appear as
healthybefore you proceed to the next step.llama-3.1-nemoguard-8b-topic-control Up 5 minutes (healthy) llama-3.1-nemoguard-8b-content-safety Up 5 minutes (healthy) nemo-guardrails-microservice Up 4 minutes (healthy)
Option 2: NVIDIA-Hosted Deployment#
To deploy all guardrails services using NVIDIA-hosted models, use the following procedure.
The RAG Server must be running before you start NeMo Guardrails services.
Verify that the model names in the configuration file are correct by running the following code.
cat deploy/compose/nemoguardrails/config-store/nemoguard_cloud/config.ymlEnsure that the model names in this file match the models available in your NVIDIA API account. You might need to update these names based on the specific models that you have access to.
Enable guardrails by running the following code.
# Set configuration for cloud deployment export ENABLE_GUARDRAILS=true export DEFAULT_CONFIG=nemoguard_cloud export NIM_ENDPOINT_URL=https://integrate.api.nvidia.com/v1
Start the Guardrails microservice by running the following code.
docker compose -f deploy/compose/docker-compose-nemo-guardrails.yaml up -d --no-deps nemo-guardrails-microservice
Restart the RAG server by running the following code.
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d
Enable Guardrails from the UI or while sending API request#
After the services are running, you can enable guardrails from the RAG UI:
Open the RAG UI.
Click Settings.
In the Output Preferences section, toggle Guardrails to on.
If you are using notebooks or APIs to interact directly with rag-server, set enable_guardrails to True in your /generate request payload.
Troubleshooting#
stream_async() Error with Output Rails#
The RAG server streams responses by default. If NeMo Guardrails is enabled with output rails, the active Guardrails configuration must also enable streaming for output rails. Otherwise, the Guardrails microservice can log the following error:
stream_async() cannot be used when output rails are configured but rails.output.streaming.enabled is False
For self-hosted Guardrails, verify that
deploy/compose/nemoguardrails/config-store/nemoguard/config.yml includes:
rails:
output:
streaming:
enabled: true
After updating the file, restart the Guardrails microservice:
docker compose -f deploy/compose/docker-compose-nemo-guardrails.yaml up -d --no-deps nemo-guardrails-microservice
The Failed to export traces to localhost:4317 messages are OpenTelemetry
export warnings. They do not cause the stream_async() failure. Start the
observability profile or disable trace export if you want to remove those
warnings.
GPU Device ID Issues#
If you encounter GPU device errors, you can customize the GPU device IDs used by the guardrails services. By default, the services use GPUs 6 and 7, but you can set specific GPUs by setting these environment variables before starting the service:
# Specify which GPUs to use for guardrail services
export CONTENT_SAFETY_GPU_ID=0 # Default is GPU 0
export TOPIC_CONTROL_GPU_ID=1 # Default is GPU 1
This allows you to control which specific GPUs are assigned to each model in multi-GPU systems.
Service Health Check#
To verify if the guardrails services are running properly:
docker ps --format "table {{.Names}}\t{{.Status}}" | grep -E "guardrails|safety|topic"
nemo-guardrails-microservice Up 19 minutes
llama-3.1-nemoguard-8b-topic-control Up 19 minutes
llama-3.1-nemoguard-8b-content-safety Up 19 minutes