Output Rail Streaming Configuration

View as Markdown

Configure how output rails process streamed tokens under rails.output.streaming.

Configuration

1rails:
2 output:
3 flows:
4 - self check output
5 streaming:
6 enabled: True
7 chunk_size: 200
8 context_size: 50
9 stream_first: True

Parameters

ParameterTypeDefaultDescription
enabledboolFalseMust be True to use stream_async() with output rails
chunk_sizeint200Number of tokens per chunk that output rails process
context_sizeint50Tokens carried over between chunks for continuity
stream_firstboolTrueIf True, the client receives tokens before output rails run on the chunk

Tips for Setting Parameters

enabled

When you configure output rails and want to use stream_async(), set this to True.

If not enabled, you receive an error:

stream_async() cannot be used when output rails are configured but
rails.output.streaming.enabled is False. Either set
rails.output.streaming.enabled to True in your configuration, or use
generate_async() instead of stream_async().

chunk_size

The number of tokens buffered before output rails run.

  • Larger values: Fewer rail executions, but higher latency to first output
  • Smaller values: More rail executions, but faster time-to-first-token

Default: 200 tokens

context_size

The number of tokens from the previous chunk carried over to provide context for the next chunk.

This helps output rails make consistent decisions across chunk boundaries. For example, if a sentence spans two chunks, the context ensures the rail can evaluate the complete sentence.

Default: 50 tokens

stream_first

Controls when tokens are streamed relative to output rail processing:

  • True (default): The client receives each chunk of tokens before output rails process that chunk. This provides faster time-to-first-token, but if a rail blocks the content, the user has already received the tokens. The stream terminates with a JSON error on violation.
  • False: Output rails process each chunk before the client receives tokens. The user never sees blocked content, but time-to-first-token increases by the rail execution time per chunk.

Requirements

Output rail streaming requires using the stream_async() method:

1rails:
2 output:
3 flows:
4 - self check output
5 streaming:
6 enabled: True

The top-level streaming: True field is deprecated and no longer required. Use stream_async() directly instead.


Usage Examples

Basic Output Rail Streaming

1rails:
2 output:
3 flows:
4 - self check output
5 streaming:
6 enabled: True
7 chunk_size: 200
8 context_size: 50

Parallel Output Rails With Streaming

For parallel execution of multiple output rails during streaming:

1rails:
2 output:
3 parallel: True
4 flows:
5 - content_safety_check
6 - pii_detection
7 - hallucination_check
8 streaming:
9 enabled: True
10 chunk_size: 200
11 context_size: 50
12 stream_first: True

Low-Latency Configuration

For faster time-to-first-token with smaller chunks:

1rails:
2 output:
3 flows:
4 - self check output
5 streaming:
6 enabled: True
7 chunk_size: 50
8 context_size: 20
9 stream_first: True

With stream_first: True, the client receives tokens before output rails run. If a rail blocks the content, the user has already received the tokens up to that point. The stream terminates with a JSON error object when it detects a violation.

Safety-First Configuration

For maximum safety with rails applied before streaming:

1rails:
2 output:
3 flows:
4 - content_safety_check
5 streaming:
6 enabled: True
7 chunk_size: 300
8 context_size: 75
9 stream_first: False

How It Works

  1. Token Buffering: The system buffers tokens from the LLM until chunk_size tokens accumulate.
  2. Streaming or Rail Execution (depends on stream_first):
    • stream_first: True (default): The client receives the new tokens immediately, then output rails run on the chunk (including context). If the rails block the content, the stream terminates with a JSON error, while the client receives the tokens up to that point.
    • stream_first: False: Output rails run on the chunk first. The client receives the new tokens only if rails pass. If the rails block the content, the client never receives the tokens.
  3. Context Overlap: The system retains the last context_size tokens from the current chunk and prepends them to the next chunk’s processing context. This gives rails visibility across chunk boundaries.
  4. Blocking: If any rail blocks the content, the stream yields a JSON error object ({"error": {...}}) and terminates immediately.

stream_first: True (default)

Buffer fills to chunk_size
Yield new tokens to client (user sees them immediately)
Run output rails on [context + new tokens]
Pass → continue to next chunk
Block → yield JSON error, terminate stream

stream_first: False

Buffer fills to chunk_size
Run output rails on [context + new tokens]
Pass → yield new tokens to client
Block → yield JSON error, terminate stream (user never sees blocked content)

Buffer Overlap

The client receives only new tokens. Output rails use the context_size tokens solely for processing context:

Chunk 1: rails process [token1 ... token200]
user receives [token1 ... token200]
Chunk 2: rails process [token151 ... token200, token201 ... token400]
└── context_size ──┘ └── new tokens ───────┘
user receives [token201 ... token400]

Python API

1from nemoguardrails import LLMRails, RailsConfig
2
3config = RailsConfig.from_path("./config")
4rails = LLMRails(config)
5
6messages = [{"role": "user", "content": "Tell me a story"}]
7
8# stream_async() automatically uses output rail streaming when configured
9async for chunk in rails.stream_async(messages=messages):
10 print(chunk, end="", flush=True)