Chat with Guardrailed Model
Use the /v1/chat/completions endpoint to send messages and receive guarded responses from the server.
The endpoint is compatible with the OpenAI Chat Completions API,
with additional guardrails-specific fields nested under a guardrails object.
Basic Request
Send a POST request to the chat completions endpoint.
The model field is required and specifies which LLM to use.
Guardrails-specific fields such as config_id are nested under the guardrails object.
Response
The response follows the standard OpenAI ChatCompletion format, with an additional guardrails object containing guardrails-specific output data.
The guardrails response object may include additional fields depending on your request options:
state— State object for continuing the conversation. Return this in subsequent requests to resume.llm_output— Additional LLM output data (whenguardrails.options.llm_outputistrue).output_data— Values for requested context variables (whenguardrails.options.output_varsis set).log— Logging information (whenguardrails.options.logis configured).
Using the OpenAI Python SDK
Since the server is OpenAI-compatible, you can use the OpenAI Python SDK to interact with it.
Pass guardrails-specific fields using the extra_body parameter.
Using Python Requests
Combine Multiple Configurations
You can combine multiple guardrails configurations in a single request using config_ids inside the guardrails object.
Use either config_id or config_ids, but not both — they are mutually exclusive.
The configurations combine in the order specified. If there are conflicts, the last configuration takes precedence.
All configurations must use the same model type and engine.
Example: Atomic Configurations
Create reusable atomic configurations that you can combine as needed:
input_checking: Uses the self-check input railoutput_checking: Uses the self-check output railmain: Uses the base LLM with no guardrails
Without input checking:
With input checking:
The input rail blocks the inappropriate message before it reaches the LLM.
Use the Default Configuration
If the server was started with --default-config-id, you can omit the guardrails object:
Streaming Responses
Enable streaming to receive partial responses as server-sent events (SSE). Each chunk follows the OpenAI streaming format.
Using curl
The server sends chunks in SSE format:
Using the OpenAI Python SDK
Using Python Requests
Conversation Threads
Use thread_id inside the guardrails object to maintain conversation history on the server.
This is useful when you can only send the latest message rather than the full history.
The thread_id must be between 16 and 255 characters long.
The thread_id is currently not implemented in the NeMo Guardrails microservices.
Configure Thread Storage
To use threads, register a datastore in the server’s config.py:
To use RedisStore, install aioredis >= 2.0.1.
Thread Limitations
- Threads are not supported in streaming mode.
- Threads are stored indefinitely with no automatic cleanup.
Add Context
Include additional context data in your request using the context field inside the guardrails object:
Control Generation Options
Use the options field inside the guardrails object to control which rails are applied and what information is returned:
Standard OpenAI Parameters
You can also pass standard OpenAI parameters such as temperature, max_tokens, top_p, stop, presence_penalty, and frequency_penalty at the top level:
For complete details on generation options, see Create Chat Completion.