Guardrailing Bot Reasoning Content

Reasoning-capable large language models (LLMs) expose their internal thought process as reasoning traces. These traces reveal how the model arrives at its conclusions, providing transparency into the decision-making process. However, they may also contain sensitive information or problematic reasoning patterns that need to be monitored and controlled.

The NeMo Guardrails library helps you set up guardrails to inspect and control these reasoning traces by extracting them. With this feature, you can configure guardrails that can block responses based on the model’s reasoning process, enhance moderation decisions with reasoning context, or monitor reasoning patterns.

This guide uses Colang 1.0 syntax. Colang 1.0 currently supports bot reasoning guardrails only.

The examples in this guide range from minimal toy examples (for understanding concepts) to complete reference implementations. These examples teach you how to access and work with bot_thinking in different contexts, not as production-ready code to copy-paste. Adapt these patterns to your specific use case with appropriate validation, error handling, and business logic for your application.

Accessing Reasoning Content

When an LLM generates a response with reasoning traces, the NeMo Guardrails library extracts the reasoning and makes it available through the bot_thinking variable. You can use this variable in the following ways.

In Colang Flows

The reasoning content is available as a context variable in Colang output rails. For example, in config/rails.co, you can set up a flow to capture the reasoning content by setting the $captured_reasoning variable to $bot_thinking.

define flow check_reasoning
  if $bot_thinking
    $captured_reasoning = $bot_thinking

In Custom Actions

When you write Python action functions in config/actions.py, you can access the reasoning through the context dictionary. For example, the following is an example action function that checks if the reasoning retrieved through context.get("bot_thinking") contains the word "sensitive". It returns False if the bot reasoning contains the word "sensitive".

1 @action(is_system_action=True)
2 async def check_reasoning(context: Optional[dict] = None):
3     bot_thinking = context.get("bot_thinking")
4     if bot_thinking and "sensitive" in bot_thinking:
5         return False
6     return True

In Prompt Templates

When you render prompts for LLM tasks such as self check output, the reasoning is available as a Jinja2 template variable. For example, in prompts.yml, you can set up a prompt to check if the reasoning contains the word "sensitive" and block the response if it does.

1 prompts:
2   - task: self_check_output
3     content: |
4       Bot message: "{{ bot_response }}"
5 
6       {% if bot_thinking %}
7       Bot reasoning: "{{ bot_thinking }}"
8       {% endif %}
9 
10       Should this be blocked (Yes or No)?

Always check if reasoning exists before using it, as not all models provide reasoning traces.

Guardrailing with Output Rails

You can use the $bot_thinking variable in output rails to inspect and control responses based on reasoning content.

Write a basic pattern-matching flow that uses the $bot_thinking variable in config/rails.co as follows:

define bot refuse to respond
  "I'm sorry, I can't respond to that."
define flow block_sensitive_reasoning
  if $bot_thinking
    if "confidential" in $bot_thinking or "internal only" in $bot_thinking
      bot refuse to respond
      stop

Add the flow to your output rails in config.yml as follows:

rails:
  output:
    flows:
      - block_sensitive_reasoning

This demonstrates how to set up a basic pattern-matching flow for learning purposes. Production implementations must use more comprehensive validation and consider edge cases.

Guardrailing with Custom Actions

For complex validation logic or reusable checks across multiple flows, you can write custom Python actions. This approach provides better code organization and makes it easier to share validation logic across different guardrails.

Write the custom action function in config/actions.py as follows:

1 from typing import Optional
2 from nemoguardrails.actions import action
3 
4 @action(is_system_action=True)
5 async def check_reasoning_quality(context: Optional[dict] = None):
6     bot_thinking = context.get("bot_thinking")
7 
8     if not bot_thinking:
9         return True
10 
11     forbidden_patterns = [
12         "proprietary information",
13         "trade secret",
14         "confidential data"
15     ]
16 
17     for pattern in forbidden_patterns:
18         if pattern.lower() in bot_thinking.lower():
19             return False
20 
21     return True

Write the flow that uses the custom action function in config/rails.co as follows:

define bot refuse to respond
  "I'm sorry, I can't respond to that."
define flow quality_check_reasoning
  $is_safe = execute check_reasoning_quality
  if not $is_safe
    bot refuse to respond
    stop

Add the flow to your output rails in config.yml.

rails:
  output:
    flows:
      - quality_check_reasoning

Using Reasoning in Self-Check Output

The following steps show how to use bot_thinking in a self-check output rail. This pattern provides reasoning traces to your moderation LLM, allowing it to make more informed decisions by evaluating both the response and the reasoning process.

This extends the self check thinking configuration examples in the NeMo Guardrails library repository.

Write the config.yml file as follows:

1 models:
2   - type: main
3     engine: <your_engine>
4     model: <your_reasoning_model>
5   - type: self_check_output
6     model: <your_moderation_model>
7     engine: <your_engine>
8 
9 rails:
10   output:
11     flows:
12       - self check output

Write the prompts.yml file as follows:

1 prompts:
2   - task: self_check_output
3     content: |
4       Your task is to check if the bot message below complies with the company policy.
5 
6       Company policy for the bot:
7       - messages should not contain any explicit content
8       - messages should not contain abusive language or offensive content
9       - messages should not contain any harmful content
10       - messages should not contain racially insensitive content
11       - if a message is a refusal, should be polite
12 
13       Bot message: "{{ bot_response }}"
14 
15       {% if bot_thinking %}
16       Bot thinking/reasoning: "{{ bot_thinking }}"
17       {% endif %}
18 
19       Question: Should the message be blocked (Yes or No)?
20       Answer:

The {% if bot_thinking %} conditional ensures that the prompt works with both reasoning and non-reasoning models. When reasoning is available, the self-check LLM can evaluate both the final response and the reasoning process.

Use the following guides to learn more about the features used in this guide.

Model Configuration: Configure LLM models in config.yml.
Self-Check Output Example: Complete working configuration example in the NeMo Guardrails library repository.
Custom Actions: Guide on writing custom actions.