Streaming LLM Responses in Real-Time
Streaming LLM Responses in Real-Time
The NeMo Guardrails library supports streaming LLM responses in real-time through the stream_async() method. No configuration is required to enable streaming—simply use stream_async() instead of generate_async().
Basic Usage
Streaming With Output Rails
When using output rails with streaming, you must configure output rail streaming:
If output rails are configured but rails.output.streaming.enabled is not set to True, calling stream_async() will raise an StreamingNotSupportedError.
Streaming With Handler
For advanced use cases requiring more control over token processing, you can use a StreamingHandler with generate_async(). The preferred approach for most use cases is stream_async(), but StreamingHandler remains supported:
Server API
Enable streaming in the request body by setting stream to true:
CLI Usage
Use the --streaming flag with the chat command:
Streaming Metadata
Use include_metadata=True in stream_async() to receive per-chunk metadata (token usage, finish reason). See Streaming Metadata for details.
Token Usage Tracking
Access token usage through the log generation option:
HuggingFace Pipeline Streaming
For LLMs deployed using HuggingFacePipeline, additional configuration is required:
This example uses NeMo Guardrails’ LangChain HuggingFace pipeline adapter, which depends on LangChain. It requires NEMOGUARDRAILS_LLM_FRAMEWORK=langchain and the corresponding LangChain HuggingFace provider package.
Related Topics
- Output Rail Streaming - Configure streaming for output rails
- Model Configuration - Configure the main LLM