Streaming in ACE Agent#
Most bots built on ACE Agent use an LLM to generate the bot response. The LLM response may come from NVIDIA NeMo Guardrails (for example, Colang-based bots like the Stock bot) or from the Plugin server (for example, RAG or LangChain-based bots). In either of these cases, the perceived latency can be reduced by streaming the response from ACE Agent to the client as it is being generated.
ACE Agent Event Interface and Colang 2.0 bots only support sentence level streaming, while Colang 1.0 bots support token level streaming.
Streaming in Colang 2.0 Bots#
Text Streaming#
In Colang 2.0 scripts all bot actions are communicated using UMIM events. These events are picked up by various microservices to execute on those action events (microservices communicate the events using Redis streams). Text responses from the bot are communicated using UtteranceBotAction
events. The Chat Controller microservice needs at least a sentence or partial sentence to generate TTS. Because of that, the Colang 2.0 bots don’t use token level streaming for bot responses and instead rely on bot developers to break the bot response into multiple UtteranceBotAction
events. Flows such as llm continue interaction
from the Colang llm
library will break the bot response in multiple actions by default.
If you are generating a text response using the Plugin server, the plugin endpoint should return a streaming response to minimize latencies.
If the Plugin server is called by the bot and returns a non-streaming response, you should use the
InvokeFulfillmentAction
action to get a response.If the response from the Plugin server is a text stream, you can use the
InvokeStreamingFulfillmentAction
action to start gathering streaming text chunks. You can also callStreamingResponseFulfillmentAction
to receive all chunks as a single response. If you want to break the response into sentences or even partial sentences, you can use the regex pattern for the same.# Invoke endpoint from plugin $started = await InvokeStreamingFulfillmentAction(question=$transcript,endpoint="your/endpoint") if $started # Get first sentence from plugin response $response = await StreamingResponseFulfillmentAction(endpoint="your/endpoint",pattern=r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![0-9]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s") while $response bot say $response # Check for next sentence $response = await StreamingResponseFulfillmentAction(endpoint="your/endpoint",pattern=r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![0-9]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s")
If the response from the Plugin server is a stream of JSON responses in the Chat Engine response schema, the JSON chunks will be parsed with their Response. You can use the
InvokeStreamingChatAction
action to start gathering streaming JSON chunks. You can also call theStreamingResponseChatAction
action repeatedly to receive the next sentence.# Invoke /chat endpoint from plugin $started = await InvokeStreamingChatAction(question=$transcript,endpoint="rag/chat",chat_history=True) if $started # Get first sentence from RAG response $response = await StreamingResponseChatAction(endpoint="rag/chat") while $response bot say $response # Check for next sentence $response = await StreamingResponseChatAction(endpoint="rag/chat")
The pattern used for breaking sentences in this example and in the example above for text streams is r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![0-9]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s"
. You can set the pattern parameter in StreamingResponseChatAction
or StreamingResponseFulfillmentAction
for your own splitting patterns.
Streaming in Colang 1.0 Bots#
Text Streaming#
The ACE Agent Chat Engine uses a streaming handler as a common interface between NeMo Guardrails and the Plugin server. If streaming is enabled, the streaming handler is attached to the NeMo Guardrails context for each query. Any chunks pushed to the streaming handler are post-processed and converted to the response schema of the /chat
or /event
endpoint based on the request endpoint.
For cases in which the response is formed using NeMo Guardrails, the generate_bot_message
action is called. If an LLM is used for bot response generation, the streaming endpoint of the LLM is called and the streamed chunks are added to the streaming handler.
To form a response from the Plugin server, the plugin endpoint should return a streaming response.
If the Plugin server is called by the bot and returns a non-streaming response, the response is not added to the stream.
If the response from the Plugin server is a text stream, the streamed chunks are added to the streaming handler by default.
If the response from the Plugin server is a stream of JSON responses in the Chat Engine response schema, the JSON chunks will be parsed and their Response. The text attribute will be added to the streaming handler by default.
The default behavior for streaming responses from the Plugin server is adding the chunks to the streaming handler. However, this can be disabled using the streaming argument of the plugin or chat_plugin
action in your Colang files.
define flow user … $answer = execute plugin(endpoint="/your/endpoint", streaming=False) …
The ACE Agent Chat Engine has in-built protections to handle cases when the bot uses a static response template or does not create a text response at all. If your bot uses streaming responses in some cases and non-streaming responses in other cases, it is still beneficial to keep streaming enabled - Chat Engine will push the static response to the streaming handler as a single chunk.
Streaming Exceptions#
There are certain cases in which streaming is disabled in ACE Agent.
If the LLM used by the bot does not support streaming.
If Output Rails are enabled in NeMo Guardrails.
In either of these cases, streaming will be disabled during bot initialization, even if streaming is enabled in the bot config file.
TTS Streaming#
For bots that use speech or avatars, the ACE Agent Chat Controller is responsible for interacting with ACE Agent and processing the streaming response. The ACE Agent Chat Controller uses streaming by default, but this can be overridden in the speech_config.yaml
file in the bot directory.
dialog_manager: DialogManager: server: "http://localhost:9000" use_streaming: false
If streaming is enabled in the bot and in the Chat Controller, then the Chat Controller reads the incoming stream and breaks the stream at every sentence boundary. Each sentence is then streamed to the client as text. If the query was a speech query, each sentence is passed to the TTS module and the TTS audio buffers are also streamed to the client. In practice, for long LLM responses, the first text and TTS chunks are received after the LLM generates the first sentence instead of after the full token generated by the LLM.