Streaming LLM Responses in Real-Time | NVIDIA NeMo Guardrails Library Developer Guide

The NeMo Guardrails library supports streaming LLM responses in real-time through the stream_async() method. No configuration is required to enable streaming—simply use stream_async() instead of generate_async().

Basic Usage

1 from nemoguardrails import LLMRails, RailsConfig
2 
3 config = RailsConfig.from_path("./config")
4 rails = LLMRails(config)
5 
6 messages = [{"role": "user", "content": "Hello!"}]
7 
8 async for chunk in rails.stream_async(messages=messages):
9     print(chunk, end="", flush=True)

Streaming With Output Rails

When using output rails with streaming, you must configure output rail streaming:

1 rails:
2   output:
3     flows:
4       - self check output
5     streaming:
6       enabled: True

If output rails are configured but rails.output.streaming.enabled is not set to True, calling stream_async() will raise an StreamingNotSupportedError.

Streaming With Handler

For advanced use cases requiring more control over token processing, you can use a StreamingHandler with generate_async(). The preferred approach for most use cases is stream_async(), but StreamingHandler remains supported:

1 from nemoguardrails import LLMRails, RailsConfig
2 from nemoguardrails.streaming import StreamingHandler
3 import asyncio
4 
5 config = RailsConfig.from_path("./config")
6 rails = LLMRails(config)
7 
8 streaming_handler = StreamingHandler()
9 
10 async def process_tokens():
11     async for chunk in streaming_handler:
12         print(chunk, end="", flush=True)
13 
14 asyncio.create_task(process_tokens())
15 
16 result = await rails.generate_async(
17     messages=[{"role": "user", "content": "Hello!"}],
18     streaming_handler=streaming_handler
19 )

Server API

Enable streaming in the request body by setting stream to true:

1 {
2     "config_id": "my_config",
3     "messages": [{"role": "user", "content": "Hello!"}],
4     "stream": true
5 }

CLI Usage

Use the --streaming flag with the chat command:

$ nemoguardrails chat path/to/config --streaming

Streaming Metadata

Use include_metadata=True in stream_async() to receive per-chunk metadata (token usage, finish reason). See Streaming Metadata for details.

Token Usage Tracking

Access token usage through the log generation option:

1 response = rails.generate(messages=messages, options={
2     "log": {
3         "llm_calls": True
4     }
5 })
6 
7 for llm_call in response.log.llm_calls:
8     print(f"Total tokens: {llm_call.total_tokens}")
9     print(f"Prompt tokens: {llm_call.prompt_tokens}")
10     print(f"Completion tokens: {llm_call.completion_tokens}")

HuggingFace Pipeline Streaming

For LLMs deployed using HuggingFacePipeline, additional configuration is required:

1 from nemoguardrails.integrations.langchain.providers.huggingface import AsyncTextIteratorStreamer
2 
3 # Create streamer with tokenizer
4 streamer = AsyncTextIteratorStreamer(tokenizer, skip_prompt=True)
5 params = {"temperature": 0.01, "max_new_tokens": 100, "streamer": streamer}
6 
7 pipe = pipeline(
8     # other parameters
9     **params,
10 )
11 
12 llm = HuggingFacePipelineCompatible(pipeline=pipe, model_kwargs=params)

This example uses NeMo Guardrails’ LangChain HuggingFace pipeline adapter, which depends on LangChain. It requires NEMOGUARDRAILS_LLM_FRAMEWORK=langchain and the corresponding LangChain HuggingFace provider package.

Output Rail Streaming - Configure streaming for output rails
Model Configuration - Configure the main LLM