Deploy NeMo Guardrails with Docker#

Run the microservice on your local machine using Docker Compose for experimentation.

Note

The time to complete this tutorial is approximately 20 minutes.

Prerequisites#

Install Docker.
You have an NGC API key for access to NVIDIA NGC container registry and model endpoints on build.nvidia.com. For more information about getting a new NGC API key, refer to Generating NGC API Keys in the NVIDIA NGC Catalog documentation. Specify the NGC Catalog and Public API Endpoints permissions when you generate the key.
Sufficient disk space for generated artifacts (recommended: 1GB if using build.nvidia.com to serve all four NIMs, 175GB if downloading all NIMs locally).
Download and install the NGC CLI. Refer to Getting Started with the NGC CLI. Make sure that you set up the NGC CLI as specified in the NVIDIA NGC Catalog Documentation.

Download the Guardrails Docker Compose Stack#

Log in to NVIDIA NGC using your NGC API key.
1. Set the NGC_CLI_API_KEY environment variable with your NGC API key. The NGC CLI uses this key to authenticate with the NVIDIA NGC container registry:
```
$ export NGC_CLI_API_KEY="<your-ngc-api-key>"
```
2. Log in to the registry:
```
$ docker login nvcr.io --username '$oauthtoken' --password-stdin <<< $NGC_CLI_API_KEY
```

Download the Docker Compose configuration from NGC:

ngc registry resource download-version "nvidia/nemo-microservices/nemo-microservices-quickstart:25.11"
cd nemo-microservices-quickstart_v25.11

Run Guardrails Microservice with NIM Microservices#

The NeMo Guardrails Docker Compose stack includes configuration for running one application LLM NIM microservice (Llama 3.3 70B) and three NemoGuard NIM microservices (JailbreakDetection, ContentSafety, and TopicControl). Choose one of the following options to run the microservice:

build.nvidia.com: Choose this option to run the microservice with the NIM microservices hosted on build.nvidia.com. This doesn’t require any local resources.
Local NIMs: Choose this option to run the microservice with the NIM microservices hosted on your local machine. This requires a local machine with four L40, A100, or H100 GPUs with 80GB of memory.

build.nvidia.com

Set environment variables that the NIM microservices use to authenticate with the build.nvidia.com API.

NIM_API_KEY used to authenticate with your main model when running inference through /v1/guardrail/chat/completions and /v1/guardrail/completions Guardrails endpoints.
NVIDIA_API_KEY used to authenticate with guardrail models that evaluate user input and the main model’s output against your configured policies.

Note

Both variables are normally the same key. You can also place them in a .env file in the same directory as the docker compose file. Docker Compose automatically loads .env from the working directory.

$ export NIM_API_KEY="<your-ngc-api-key>"
$ export NVIDIA_API_KEY="<your-ngc-api-key>"

Start the Guardrails service and use the demonstration configuration:

$ export NEMO_MICROSERVICES_IMAGE_REGISTRY=nvcr.io/nvidia/nemo-microservices
$ export NEMO_MICROSERVICES_IMAGE_TAG=25.11
$ docker compose --profile guardrails up

Local NIMs

Set the GUARDRAILS_CONFIG_TYPE environment variable to local. Also set the NGC_API_KEY environment variable with your NGC API key. The NeMo Guardrails microservice uses this key to authenticate with the build.nvidia.com API:
```
$ export GUARDRAILS_CONFIG_TYPE=local
$ export NGC_API_KEY="<your-ngc-api-key>"
```

Start the Guardrails service and with the associated NIM microservices.

$ docker compose --profile guardrails --profile guardrails-nims up

Run Inference#

After the Guardrails service starts, you can start sending requests to the Guardrails API endpoints running on http://localhost:8080.

The following examples show how to make an inference request to the Guardrails API endpoint with the quickstart configuration. This configuration contains input rails that execute a content safety check and topic safety check on the user input, and an output rail that executes a content safety check on the output.

The following example shows a safe user input that should not be blocked by the input or output rails.

cURL

curl -X POST http://0.0.0.0:8080/v1/guardrail/chat/completions \
   -H 'Accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
      "model": "meta/llama-3.3-70b-instruct",
      "messages": [
         {
            "role": "user",
            "content": "What can you do for me?"
         }
      ],
      "max_tokens": 256,
      "stream": false,
      "temperature": 1,
      "top_p": 1,
      "guardrails": {
         "config_id": "quickstart"
      }
   }' | jq

Python SDK

Use the Python SDK and Configuration

Before running the following code, install the NeMo Microservices Python SDK and huggingface_hub.

from nemo_microservices import NeMoMicroservices
from huggingface_hub import HfApi


nmp_client = NeMoMicroservices(base_url="http://localhost:8080")

response = nmp_client.guardrail.chat.completions.create(
   model="meta/llama-3.3-70b-instruct",
   messages=[
      {
         "role": "user",
         "content": "What can you do for me?"
      }
   ],
   max_tokens=256,
   stream=False,
   temperature=1,
   top_p=1,
   guardrails={"config_id": "quickstart"}
)

print(response.to_json())

The following example shows an unsafe user input that should be blocked by the input rails.

cURL

curl -X POST http://0.0.0.0:8080/v1/guardrail/chat/completions \
   -H 'Accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
      "model": "meta/llama-3.3-70b-instruct",
      "messages": [
         {
            "role": "user",
            "content": "How can I hotwire a car that uses an electronic starter?"
         }
      ],
      "max_tokens": 256,
      "stream": false,
      "temperature": 1,
      "top_p": 1,
      "guardrails": {
         "config_id": "quickstart"
      }
   }' | jq

Python SDK

Use the Python SDK and Configuration

Before running the following code, install the NeMo Microservices Python SDK and huggingface_hub.

from nemo_microservices import NeMoMicroservices
from huggingface_hub import HfApi


nmp_client = NeMoMicroservices(base_url="http://localhost:8080")

response = nmp_client.guardrail.chat.completions.create(
   model="meta/llama-3.3-70b-instruct",
   messages=[
      {
         "role": "user",
         "content": "How can I hotwire a car that uses an electronic starter?"
      }
   ],
   max_tokens=256,
   stream=False,
   temperature=1,
   top_p=1,
   guardrails={"config_id": "quickstart"}
)

print(response.to_json())

Pre-defined Configurations in the Docker Compose Artifact#

The Docker Compose artifact contains two configurations: default and quickstart.

The default configuration has only the model configuration. This is useful for verifying connectivity to the NIM microservices.
The quickstart configuration has the model configuration and input and output rail configurations. This is useful for running inference with guardrails applied.

Stop the Guardrails Service#

Run the following command to stop the Guardrails service:

$ docker compose --profile guardrails down

If you run the microservice with local NIMs, also run:

$ docker compose --profile guardrails-nims down

Next Steps#

For more tutorials, refer to Guardrail Tutorials.
For instructions on how to deploy the microservice on your Kubernetes cluster for production at scale, refer to NeMo Guardrails Deployment Guide.