Chat with Guardrailed Model | NVIDIA NeMo Guardrails Library Developer Guide

Use the /v1/chat/completions endpoint to send messages and receive guarded responses from the server. The endpoint is compatible with the OpenAI Chat Completions API, with additional guardrails-specific fields nested under a guardrails object.

Basic Request

Send a POST request to the chat completions endpoint. The model field is required and specifies which LLM to use. Guardrails-specific fields such as config_id are nested under the guardrails object.

$ curl -X POST http://localhost:8000/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "meta/llama-3.1-8b-instruct",
>     "messages": [
>       {"role": "user", "content": "Hello! What can you do for me?"}
>     ],
>     "guardrails": {
>       "config_id": "content_safety"
>     }
>   }'

Response

The response follows the standard OpenAI ChatCompletion format, with an additional guardrails object containing guardrails-specific output data.

1 {
2   "id": "chatcmpl-abc123",
3   "object": "chat.completion",
4   "created": 1700000000,
5   "model": "meta/llama-3.1-8b-instruct",
6   "choices": [
7     {
8       "index": 0,
9       "message": {
10         "role": "assistant",
11         "content": "I can help you with your questions. What would you like to know?"
12       },
13       "finish_reason": "stop"
14     }
15   ],
16   "guardrails": {
17     "config_id": "content_safety",
18     "state": null,
19     "llm_output": null,
20     "output_data": null,
21     "log": null
22   }
23 }

The guardrails response object may include additional fields depending on your request options:

state — State object for continuing the conversation. Return this in subsequent requests to resume.
llm_output — Additional LLM output data (when guardrails.options.llm_output is true).
output_data — Values for requested context variables (when guardrails.options.output_vars is set).
log — Logging information (when guardrails.options.log is configured).

Using the OpenAI Python SDK

Since the server is OpenAI-compatible, you can use the OpenAI Python SDK to interact with it. Pass guardrails-specific fields using the extra_body parameter.

1 from openai import OpenAI
2 
3 client = OpenAI(
4     base_url="http://localhost:8000/v1",
5     api_key="not-used"  # Required by OpenAI SDK but not used by the guardrails server
6 )
7 
8 response = client.chat.completions.create(
9     model="meta/llama-3.1-8b-instruct",
10     messages=[
11         {"role": "user", "content": "Hello! What can you do for me?"}
12     ],
13     extra_body={
14         "guardrails": {
15             "config_id": "content_safety"
16         }
17     }
18 )
19 
20 print(response.choices[0].message.content)

Using Python Requests

1 import requests
2 
3 base_url = "http://localhost:8000"
4 
5 response = requests.post(f"{base_url}/v1/chat/completions", json={
6     "model": "meta/llama-3.1-8b-instruct",
7     "messages": [
8         {"role": "user", "content": "Hello! What can you do for me?"}
9     ],
10     "guardrails": {
11         "config_id": "content_safety"
12     }
13 })
14 
15 print(response.json())

Combine Multiple Configurations

You can combine multiple guardrails configurations in a single request using config_ids inside the guardrails object. Use either config_id or config_ids, but not both — they are mutually exclusive.

1 response = requests.post(f"{base_url}/v1/chat/completions", json={
2     "model": "meta/llama-3.1-8b-instruct",
3     "messages": [
4         {"role": "user", "content": "Hello!"}
5     ],
6     "guardrails": {
7         "config_ids": ["main", "input_checking", "output_checking"]
8     }
9 })

The configurations combine in the order specified. If there are conflicts, the last configuration takes precedence.

All configurations must use the same model type and engine.

Example: Atomic Configurations

Create reusable atomic configurations that you can combine as needed:

input_checking: Uses the self-check input rail
output_checking: Uses the self-check output rail
main: Uses the base LLM with no guardrails

Without input checking:

1 response = requests.post(f"{base_url}/v1/chat/completions", json={
2     "model": "meta/llama-3.1-8b-instruct",
3     "messages": [{"role": "user", "content": "You are stupid."}],
4     "guardrails": {
5         "config_id": "main"
6     }
7 })
8 print(response.json()["choices"][0]["message"]["content"])
9 # LLM responds to the message

With input checking:

1 response = requests.post(f"{base_url}/v1/chat/completions", json={
2     "model": "meta/llama-3.1-8b-instruct",
3     "messages": [{"role": "user", "content": "You are stupid."}],
4     "guardrails": {
5         "config_ids": ["main", "input_checking"]
6     }
7 })
8 print(response.json()["choices"][0]["message"]["content"])
9 # "I'm sorry, I can't respond to that."

The input rail blocks the inappropriate message before it reaches the LLM.

Use the Default Configuration

If the server was started with --default-config-id, you can omit the guardrails object:

1 response = requests.post(f"{base_url}/v1/chat/completions", json={
2     "model": "meta/llama-3.1-8b-instruct",
3     "messages": [
4         {"role": "user", "content": "Hello!"}
5     ]
6 })

Streaming Responses

Enable streaming to receive partial responses as server-sent events (SSE). Each chunk follows the OpenAI streaming format.

Using curl

$ curl -X POST http://localhost:8000/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "meta/llama-3.1-8b-instruct",
>     "messages": [{"role": "user", "content": "Tell me a story"}],
>     "stream": true,
>     "guardrails": {
>       "config_id": "content_safety"
>     }
>   }'

The server sends chunks in SSE format:

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1700000000,"model":"meta/llama-3.1-8b-instruct","choices":[{"delta":{"content":"Once"},"index":0,"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1700000000,"model":"meta/llama-3.1-8b-instruct","choices":[{"delta":{"content":" upon"},"index":0,"finish_reason":null}]}
data: [DONE]

Using the OpenAI Python SDK

1 from openai import OpenAI
2 
3 client = OpenAI(
4     base_url="http://localhost:8000/v1",
5     api_key="not-used"
6 )
7 
8 stream = client.chat.completions.create(
9     model="meta/llama-3.1-8b-instruct",
10     messages=[{"role": "user", "content": "Tell me a story"}],
11     stream=True,
12     extra_body={
13         "guardrails": {
14             "config_id": "content_safety"
15         }
16     }
17 )
18 
19 for chunk in stream:
20     if chunk.choices[0].delta.content is not None:
21         print(chunk.choices[0].delta.content, end="")

Using Python Requests

1 import requests
2 
3 response = requests.post(
4     f"{base_url}/v1/chat/completions",
5     json={
6         "model": "meta/llama-3.1-8b-instruct",
7         "messages": [{"role": "user", "content": "Tell me a story"}],
8         "stream": True,
9         "guardrails": {
10             "config_id": "content_safety"
11         }
12     },
13     stream=True
14 )
15 
16 for line in response.iter_lines():
17     if line:
18         print(line.decode())

Conversation Threads

Use thread_id inside the guardrails object to maintain conversation history on the server. This is useful when you can only send the latest message rather than the full history.

The thread_id must be between 16 and 255 characters long.

1 # First message
2 response = requests.post(f"{base_url}/v1/chat/completions", json={
3     "model": "meta/llama-3.1-8b-instruct",
4     "messages": [{"role": "user", "content": "My name is Alice."}],
5     "guardrails": {
6         "config_id": "content_safety",
7         "thread_id": "user-session-12345678"
8     }
9 })
10 
11 # Follow-up message (server remembers the conversation)
12 response = requests.post(f"{base_url}/v1/chat/completions", json={
13     "model": "meta/llama-3.1-8b-instruct",
14     "messages": [{"role": "user", "content": "What is my name?"}],
15     "guardrails": {
16         "config_id": "content_safety",
17         "thread_id": "user-session-12345678"
18     }
19 })
20 # The assistant remembers "Alice"

The thread_id is currently not implemented in the NeMo Guardrails microservices.

Configure Thread Storage

To use threads, register a datastore in the server’s config.py:

1 # config.py in the root of your configurations folder
2 from nemoguardrails.server.api import register_datastore
3 from nemoguardrails.server.datastore.memory_store import MemoryStore
4 
5 # For testing
6 register_datastore(MemoryStore())
7 
8 # For production, use Redis:
9 # from nemoguardrails.server.datastore.redis_store import RedisStore
10 # register_datastore(RedisStore(redis_url="redis://localhost:6379"))

To use RedisStore, install aioredis >= 2.0.1.

Thread Limitations

Threads are not supported in streaming mode.
Threads are stored indefinitely with no automatic cleanup.

Add Context

Include additional context data in your request using the context field inside the guardrails object:

1 response = requests.post(f"{base_url}/v1/chat/completions", json={
2     "model": "meta/llama-3.1-8b-instruct",
3     "messages": [{"role": "user", "content": "What is my account balance?"}],
4     "guardrails": {
5         "config_id": "content_safety",
6         "context": {
7             "user_id": "12345",
8             "account_type": "premium"
9         }
10     }
11 })

Control Generation Options

Use the options field inside the guardrails object to control which rails are applied and what information is returned:

1 response = requests.post(f"{base_url}/v1/chat/completions", json={
2     "model": "meta/llama-3.1-8b-instruct",
3     "messages": [{"role": "user", "content": "Hello"}],
4     "guardrails": {
5         "config_id": "content_safety",
6         "options": {
7             "rails": {
8                 "input": True,
9                 "output": True,
10                 "dialog": False
11             },
12             "log": {
13                 "activated_rails": True
14             }
15         }
16     }
17 })

Standard OpenAI Parameters

You can also pass standard OpenAI parameters such as temperature, max_tokens, top_p, stop, presence_penalty, and frequency_penalty at the top level:

1 response = requests.post(f"{base_url}/v1/chat/completions", json={
2     "model": "meta/llama-3.1-8b-instruct",
3     "messages": [{"role": "user", "content": "Hello"}],
4     "temperature": 0.7,
5     "max_tokens": 256,
6     "guardrails": {
7         "config_id": "content_safety"
8     }
9 })

For complete details on generation options, see Create Chat Completion.

Basic Request

Response

Using the OpenAI Python SDK

Using Python Requests

Combine Multiple Configurations

Example: Atomic Configurations

Use the Default Configuration

Streaming Responses

Using curl

Using the OpenAI Python SDK

Using Python Requests

Conversation Threads

Configure Thread Storage

Thread Limitations

Add Context

Control Generation Options

Standard OpenAI Parameters

Related Topics