Streaming LLM Responses in Real-Time

View as Markdown

The NeMo Guardrails library supports streaming LLM responses in real-time through the stream_async() method. No configuration is required to enable streaming—simply use stream_async() instead of generate_async().

Basic Usage

1from nemoguardrails import LLMRails, RailsConfig
2
3config = RailsConfig.from_path("./config")
4rails = LLMRails(config)
5
6messages = [{"role": "user", "content": "Hello!"}]
7
8async for chunk in rails.stream_async(messages=messages):
9 print(chunk, end="", flush=True)

Streaming With Output Rails

When using output rails with streaming, you must configure output rail streaming:

1rails:
2 output:
3 flows:
4 - self check output
5 streaming:
6 enabled: True

If output rails are configured but rails.output.streaming.enabled is not set to True, calling stream_async() will raise an StreamingNotSupportedError.


Streaming With Handler

For advanced use cases requiring more control over token processing, you can use a StreamingHandler with generate_async(). The preferred approach for most use cases is stream_async(), but StreamingHandler remains supported:

1from nemoguardrails import LLMRails, RailsConfig
2from nemoguardrails.streaming import StreamingHandler
3import asyncio
4
5config = RailsConfig.from_path("./config")
6rails = LLMRails(config)
7
8streaming_handler = StreamingHandler()
9
10async def process_tokens():
11 async for chunk in streaming_handler:
12 print(chunk, end="", flush=True)
13
14asyncio.create_task(process_tokens())
15
16result = await rails.generate_async(
17 messages=[{"role": "user", "content": "Hello!"}],
18 streaming_handler=streaming_handler
19)

Server API

Enable streaming in the request body by setting stream to true:

1{
2 "config_id": "my_config",
3 "messages": [{"role": "user", "content": "Hello!"}],
4 "stream": true
5}

CLI Usage

Use the --streaming flag with the chat command:

$nemoguardrails chat path/to/config --streaming

Streaming Metadata

Use include_metadata=True in stream_async() to receive per-chunk metadata (token usage, finish reason). See Streaming Metadata for details.

Token Usage Tracking

Access token usage through the log generation option:

1response = rails.generate(messages=messages, options={
2 "log": {
3 "llm_calls": True
4 }
5})
6
7for llm_call in response.log.llm_calls:
8 print(f"Total tokens: {llm_call.total_tokens}")
9 print(f"Prompt tokens: {llm_call.prompt_tokens}")
10 print(f"Completion tokens: {llm_call.completion_tokens}")

HuggingFace Pipeline Streaming

For LLMs deployed using HuggingFacePipeline, additional configuration is required:

1from nemoguardrails.integrations.langchain.providers.huggingface import AsyncTextIteratorStreamer
2
3# Create streamer with tokenizer
4streamer = AsyncTextIteratorStreamer(tokenizer, skip_prompt=True)
5params = {"temperature": 0.01, "max_new_tokens": 100, "streamer": streamer}
6
7pipe = pipeline(
8 # other parameters
9 **params,
10)
11
12llm = HuggingFacePipelineCompatible(pipeline=pipe, model_kwargs=params)

This example uses NeMo Guardrails’ LangChain HuggingFace pipeline adapter, which depends on LangChain. It requires NEMOGUARDRAILS_LLM_FRAMEWORK=langchain and the corresponding LangChain HuggingFace provider package.