Chat with Guardrailed Model

View as Markdown

Use the /v1/chat/completions endpoint to send messages and receive guarded responses from the server. The endpoint is compatible with the OpenAI Chat Completions API, with additional guardrails-specific fields nested under a guardrails object.

Basic Request

Send a POST request to the chat completions endpoint. The model field is required and specifies which LLM to use. Guardrails-specific fields such as config_id are nested under the guardrails object.

$curl -X POST http://localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{
> "model": "meta/llama-3.1-8b-instruct",
> "messages": [
> {"role": "user", "content": "Hello! What can you do for me?"}
> ],
> "guardrails": {
> "config_id": "content_safety"
> }
> }'

Response

The response follows the standard OpenAI ChatCompletion format, with an additional guardrails object containing guardrails-specific output data.

1{
2 "id": "chatcmpl-abc123",
3 "object": "chat.completion",
4 "created": 1700000000,
5 "model": "meta/llama-3.1-8b-instruct",
6 "choices": [
7 {
8 "index": 0,
9 "message": {
10 "role": "assistant",
11 "content": "I can help you with your questions. What would you like to know?"
12 },
13 "finish_reason": "stop"
14 }
15 ],
16 "guardrails": {
17 "config_id": "content_safety",
18 "state": null,
19 "llm_output": null,
20 "output_data": null,
21 "log": null
22 }
23}

The guardrails response object may include additional fields depending on your request options:

  • state — State object for continuing the conversation. Return this in subsequent requests to resume.
  • llm_output — Additional LLM output data (when guardrails.options.llm_output is true).
  • output_data — Values for requested context variables (when guardrails.options.output_vars is set).
  • log — Logging information (when guardrails.options.log is configured).

Using the OpenAI Python SDK

Since the server is OpenAI-compatible, you can use the OpenAI Python SDK to interact with it. Pass guardrails-specific fields using the extra_body parameter.

1from openai import OpenAI
2
3client = OpenAI(
4 base_url="http://localhost:8000/v1",
5 api_key="not-used" # Required by OpenAI SDK but not used by the guardrails server
6)
7
8response = client.chat.completions.create(
9 model="meta/llama-3.1-8b-instruct",
10 messages=[
11 {"role": "user", "content": "Hello! What can you do for me?"}
12 ],
13 extra_body={
14 "guardrails": {
15 "config_id": "content_safety"
16 }
17 }
18)
19
20print(response.choices[0].message.content)

Using Python Requests

1import requests
2
3base_url = "http://localhost:8000"
4
5response = requests.post(f"{base_url}/v1/chat/completions", json={
6 "model": "meta/llama-3.1-8b-instruct",
7 "messages": [
8 {"role": "user", "content": "Hello! What can you do for me?"}
9 ],
10 "guardrails": {
11 "config_id": "content_safety"
12 }
13})
14
15print(response.json())

Combine Multiple Configurations

You can combine multiple guardrails configurations in a single request using config_ids inside the guardrails object. Use either config_id or config_ids, but not both — they are mutually exclusive.

1response = requests.post(f"{base_url}/v1/chat/completions", json={
2 "model": "meta/llama-3.1-8b-instruct",
3 "messages": [
4 {"role": "user", "content": "Hello!"}
5 ],
6 "guardrails": {
7 "config_ids": ["main", "input_checking", "output_checking"]
8 }
9})

The configurations combine in the order specified. If there are conflicts, the last configuration takes precedence.

All configurations must use the same model type and engine.

Example: Atomic Configurations

Create reusable atomic configurations that you can combine as needed:

  1. input_checking: Uses the self-check input rail
  2. output_checking: Uses the self-check output rail
  3. main: Uses the base LLM with no guardrails

Without input checking:

1response = requests.post(f"{base_url}/v1/chat/completions", json={
2 "model": "meta/llama-3.1-8b-instruct",
3 "messages": [{"role": "user", "content": "You are stupid."}],
4 "guardrails": {
5 "config_id": "main"
6 }
7})
8print(response.json()["choices"][0]["message"]["content"])
9# LLM responds to the message

With input checking:

1response = requests.post(f"{base_url}/v1/chat/completions", json={
2 "model": "meta/llama-3.1-8b-instruct",
3 "messages": [{"role": "user", "content": "You are stupid."}],
4 "guardrails": {
5 "config_ids": ["main", "input_checking"]
6 }
7})
8print(response.json()["choices"][0]["message"]["content"])
9# "I'm sorry, I can't respond to that."

The input rail blocks the inappropriate message before it reaches the LLM.

Use the Default Configuration

If the server was started with --default-config-id, you can omit the guardrails object:

1response = requests.post(f"{base_url}/v1/chat/completions", json={
2 "model": "meta/llama-3.1-8b-instruct",
3 "messages": [
4 {"role": "user", "content": "Hello!"}
5 ]
6})

Streaming Responses

Enable streaming to receive partial responses as server-sent events (SSE). Each chunk follows the OpenAI streaming format.

Using curl

$curl -X POST http://localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{
> "model": "meta/llama-3.1-8b-instruct",
> "messages": [{"role": "user", "content": "Tell me a story"}],
> "stream": true,
> "guardrails": {
> "config_id": "content_safety"
> }
> }'

The server sends chunks in SSE format:

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1700000000,"model":"meta/llama-3.1-8b-instruct","choices":[{"delta":{"content":"Once"},"index":0,"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1700000000,"model":"meta/llama-3.1-8b-instruct","choices":[{"delta":{"content":" upon"},"index":0,"finish_reason":null}]}
data: [DONE]

Using the OpenAI Python SDK

1from openai import OpenAI
2
3client = OpenAI(
4 base_url="http://localhost:8000/v1",
5 api_key="not-used"
6)
7
8stream = client.chat.completions.create(
9 model="meta/llama-3.1-8b-instruct",
10 messages=[{"role": "user", "content": "Tell me a story"}],
11 stream=True,
12 extra_body={
13 "guardrails": {
14 "config_id": "content_safety"
15 }
16 }
17)
18
19for chunk in stream:
20 if chunk.choices[0].delta.content is not None:
21 print(chunk.choices[0].delta.content, end="")

Using Python Requests

1import requests
2
3response = requests.post(
4 f"{base_url}/v1/chat/completions",
5 json={
6 "model": "meta/llama-3.1-8b-instruct",
7 "messages": [{"role": "user", "content": "Tell me a story"}],
8 "stream": True,
9 "guardrails": {
10 "config_id": "content_safety"
11 }
12 },
13 stream=True
14)
15
16for line in response.iter_lines():
17 if line:
18 print(line.decode())

Conversation Threads

Use thread_id inside the guardrails object to maintain conversation history on the server. This is useful when you can only send the latest message rather than the full history.

The thread_id must be between 16 and 255 characters long.

1# First message
2response = requests.post(f"{base_url}/v1/chat/completions", json={
3 "model": "meta/llama-3.1-8b-instruct",
4 "messages": [{"role": "user", "content": "My name is Alice."}],
5 "guardrails": {
6 "config_id": "content_safety",
7 "thread_id": "user-session-12345678"
8 }
9})
10
11# Follow-up message (server remembers the conversation)
12response = requests.post(f"{base_url}/v1/chat/completions", json={
13 "model": "meta/llama-3.1-8b-instruct",
14 "messages": [{"role": "user", "content": "What is my name?"}],
15 "guardrails": {
16 "config_id": "content_safety",
17 "thread_id": "user-session-12345678"
18 }
19})
20# The assistant remembers "Alice"

The thread_id is currently not implemented in the NeMo Guardrails microservices.

Configure Thread Storage

To use threads, register a datastore in the server’s config.py:

1# config.py in the root of your configurations folder
2from nemoguardrails.server.api import register_datastore
3from nemoguardrails.server.datastore.memory_store import MemoryStore
4
5# For testing
6register_datastore(MemoryStore())
7
8# For production, use Redis:
9# from nemoguardrails.server.datastore.redis_store import RedisStore
10# register_datastore(RedisStore(redis_url="redis://localhost:6379"))

To use RedisStore, install aioredis >= 2.0.1.

Thread Limitations

  • Threads are not supported in streaming mode.
  • Threads are stored indefinitely with no automatic cleanup.

Add Context

Include additional context data in your request using the context field inside the guardrails object:

1response = requests.post(f"{base_url}/v1/chat/completions", json={
2 "model": "meta/llama-3.1-8b-instruct",
3 "messages": [{"role": "user", "content": "What is my account balance?"}],
4 "guardrails": {
5 "config_id": "content_safety",
6 "context": {
7 "user_id": "12345",
8 "account_type": "premium"
9 }
10 }
11})

Control Generation Options

Use the options field inside the guardrails object to control which rails are applied and what information is returned:

1response = requests.post(f"{base_url}/v1/chat/completions", json={
2 "model": "meta/llama-3.1-8b-instruct",
3 "messages": [{"role": "user", "content": "Hello"}],
4 "guardrails": {
5 "config_id": "content_safety",
6 "options": {
7 "rails": {
8 "input": True,
9 "output": True,
10 "dialog": False
11 },
12 "log": {
13 "activated_rails": True
14 }
15 }
16 }
17})

Standard OpenAI Parameters

You can also pass standard OpenAI parameters such as temperature, max_tokens, top_p, stop, presence_penalty, and frequency_penalty at the top level:

1response = requests.post(f"{base_url}/v1/chat/completions", json={
2 "model": "meta/llama-3.1-8b-instruct",
3 "messages": [{"role": "user", "content": "Hello"}],
4 "temperature": 0.7,
5 "max_tokens": 256,
6 "guardrails": {
7 "config_id": "content_safety"
8 }
9})

For complete details on generation options, see Create Chat Completion.