Output Rail Streaming Configuration
Configure how output rails process streamed tokens under rails.output.streaming.
Configuration
Parameters
Tips for Setting Parameters
enabled
When you configure output rails and want to use stream_async(), set this to True.
If not enabled, you receive an error:
chunk_size
The number of tokens buffered before output rails run.
- Larger values: Fewer rail executions, but higher latency to first output
- Smaller values: More rail executions, but faster time-to-first-token
Default: 200 tokens
context_size
The number of tokens from the previous chunk carried over to provide context for the next chunk.
This helps output rails make consistent decisions across chunk boundaries. For example, if a sentence spans two chunks, the context ensures the rail can evaluate the complete sentence.
Default: 50 tokens
stream_first
Controls when tokens are streamed relative to output rail processing:
True(default): The client receives each chunk of tokens before output rails process that chunk. This provides faster time-to-first-token, but if a rail blocks the content, the user has already received the tokens. The stream terminates with a JSON error on violation.False: Output rails process each chunk before the client receives tokens. The user never sees blocked content, but time-to-first-token increases by the rail execution time per chunk.
Requirements
Output rail streaming requires using the stream_async() method:
The top-level streaming: True field is deprecated and no longer required. Use stream_async() directly instead.
Usage Examples
Basic Output Rail Streaming
Parallel Output Rails With Streaming
For parallel execution of multiple output rails during streaming:
Low-Latency Configuration
For faster time-to-first-token with smaller chunks:
With stream_first: True, the client receives tokens before output rails run. If a rail blocks the content, the user has already received the tokens up to that point. The stream terminates with a JSON error object when it detects a violation.
Safety-First Configuration
For maximum safety with rails applied before streaming:
How It Works
- Token Buffering: The system buffers tokens from the LLM until
chunk_sizetokens accumulate. - Streaming or Rail Execution (depends on
stream_first):stream_first: True(default): The client receives the new tokens immediately, then output rails run on the chunk (including context). If the rails block the content, the stream terminates with a JSON error, while the client receives the tokens up to that point.stream_first: False: Output rails run on the chunk first. The client receives the new tokens only if rails pass. If the rails block the content, the client never receives the tokens.
- Context Overlap: The system retains the last
context_sizetokens from the current chunk and prepends them to the next chunk’s processing context. This gives rails visibility across chunk boundaries. - Blocking: If any rail blocks the content, the stream yields a JSON error object (
{"error": {...}}) and terminates immediately.
stream_first: True (default)
stream_first: False
Buffer Overlap
The client receives only new tokens. Output rails use the context_size tokens solely for processing context:
Python API
Related Topics
- Global Streaming - Enable LLM streaming
- Guardrails Configuration - Configure output rail flows