Streaming Responses
If the application LLM supports streaming, the NeMo Guardrails library can stream tokens as well. Streaming is automatically enabled when you use the stream_async() method - no configuration is required.
For information about configuring streaming with output guardrails, refer to the following:
- For configuration, refer to Output Rail Streaming.
- For sample Python client code, refer to Tutorials.
Usage
Chat CLI
You can enable streaming when launching the NeMo Guardrails library chat CLI by using the --streaming option:
Python API
You can use the streaming directly from the python API in two ways:
- Simple: receive just the chunks (tokens).
- Full: receive both the chunks as they are generated and the full response at the end.
For the simple usage, you need to call the stream_async method on the LLMRails instance:
For the full usage, you need to provide a StreamingHandler instance to the generate_async method on the LLMRails instance:
Warning: Using
StreamingHandlerdirectly is deprecated and will be removed in a future release. Usestream_async()instead.
Using External Async Token Generators
You can also provide your own async generator that yields tokens, which is useful when:
- You want to use a different LLM provider that has its own streaming API.
- You have pre-generated responses that you want to stream through guardrails.
- You want to implement custom token generation logic.
- You want to test your output rails or its config in streaming mode on predefined responses without actually relying on an actual LLM generation.
To use an external generator, pass it to the generator parameter of stream_async:
When using an external generator:
- The internal LLM generation is completely bypassed.
- Output rails are still applied to the LLM responses returned by the external generator, if configured.
- The generator should yield string tokens.
Example with a real LLM API:
This feature enables seamless integration of the NeMo Guardrails library with any streaming LLM or token source while maintaining all the safety features of output rails.
Streaming Metadata
When using stream_async(), you can receive per-chunk metadata (e.g., token usage, finish reason) by setting include_metadata=True:
With include_metadata=True, each chunk is a dict with a mandatory "text" key. The final chunk also includes a "metadata" key containing response_metadata (finish reason, model name) and usage_metadata (token counts):
Without include_metadata, chunks are plain strings (default behavior).
The include_generation_metadata parameter is deprecated. Use include_metadata instead. It will be removed in version 0.22.0.
Token Usage Tracking
Token usage statistics are available when streaming responses, depending on provider support. When the provider does not return token usage statistics, the final chunk’s metadata will contain response_metadata and usage_metadata set to None.
Accessing Token Usage Information
You can access token usage statistics through the detailed logging capabilities of the NeMo Guardrails library. Use the log generation option to capture comprehensive information about LLM calls, including token usage:
Alternatively, you can use the explain() method to get a summary of token usage:
For more information about streaming token usage support across different providers, refer to the LangChain documentation on token usage tracking. For detailed information about accessing generation logs and token usage, see Generation Options: Detailed Logging Information and Logging.
For streaming while using the Guardrails API server, refer to Chat Completions: Streaming Responses.
Streaming for LLMs Deployed Using HuggingFacePipeline
We also support streaming for LLMs deployed using HuggingFacePipeline.
One example is provided in the HF Pipeline Dolly configuration.
To use streaming for HF Pipeline LLMs, you need to create an nemoguardrails.integrations.langchain.providers.huggingface.AsyncTextIteratorStreamer streamer object,
add it to the kwargs of the pipeline and to the model_kwargs of the HuggingFacePipelineCompatible object.