Tokkio LLM-RAG#
Introduction#
The Tokkio LLM-RAG sample application is designed to facilitate live interactions with avatars using popular Large Language Model (LLM) and Retrieval-Augmented Generation (RAG) solutions.
For example, you may connect Tokkio with LLMs and RAGs from common platforms such as: Nvidia NIM, OpenAI, Nvidia Generative AI Examples, and Nvidia Enterprise RAG. Users may also connect third-party LLMs or RAGs. This flexibility is detailed in the RAG Endpoint Customization section.
If you have followed the Quick Start Guide to deploy the simplest version of Tokkio, you deployed Tokkio LLM-RAG that leveraged a LLM.
Use the Default RAG Pipeline#
The default version of Tokkio LLM-RAG, deployed in the quick start guide, leverages an LLM to generate user responses.
To use the Tokkio LLM-RAG reference app with a RAG pipeline, the user must:
Deploy a RAG server
Ingest the desired documents in the RAG
Configure Tokkio to use the RAG endpoint
Here, we introduce how to setup and deploy a RAG server from a NVIDIA example repository.
It is important to note Tokkio LLM-RAG requires a specific schema of response from the RAG endpoint.
The default support for external RAG server adheres to the schema from Nvidia Generative AI example
The RAG from NVIDIA Generation AI examples refers to a RAG server that uses NVIDIA hosted embeddings + reranking + llm endpoints to respond to a user with enhanced context from the ingested documents.
Please refer to Nvidia Generative AI examples and follow the instructions in the README to deploy the RAG server and ingest the relevant documents.
Now, follow the RAG Endpoint Customization section in Customize Reference Workflows to use the RAG server instead of an LLM in Tokkio.
Understanding the Plugin Server Resource in Tokkio#
Now that you have been introduced to what the LLM-RAG reference workflow is and how to set up a default RAG, we will walk you through more advanced modification of the plugin server resource.
Tokkio contains many microservices, one of which is the plugin server microservice. This microservice leverages the plugin server resource. Please refer to the diagram at Microservices to understand how the plugin server fits into the broader Tokkio architecture.
The plugin server resource integrates seamlessly into the larger ACE architecture by providing configurable endpoints for interacting with external RAG or LLM pipelines and defining multimodal flows with the Colang coding language to handle complex queries.
These configurations and flows are utilized by the plugin server microservice to dynamically process requests, route them through the appropriate pipelines, and parse responses.
The plugin server resource contains the necessary code to communicate with the external LLM or RAG server and incorporate the output into the Tokkio avatar.
Plugin Resource Source Code#
The source code for Tokkio LLM-RAG specific plugin server resource may be found on NGC here
File Breakdown#
Once you download the materials, take note of the following resources:
actions.py
- this contains custom actions that the Colang code can call like the plugin-server.asr_words_to_boost_conformer.txt
- a default list of words for asr word boosting.cmudict_ipa.txt
- this file allows you to customize the pronunciation of some specific words./colang
- all the files in thecolang
folder define the flow for handling various queries in the LLM bot with multimodality.model_config.yaml
- the definition of the model to use.plugin/rag.py
- this file contains the definition of the Python client of the external RAG or LLM. The /chat endpoint is implemented in this file.plugin/schemas.py
- this file contains the schema definitions for request/response schemas expected by the /chat endpoint.plugin_config.yaml
- this file contains the definition of the ACE plugin server configuration.speech_config.yaml
- this file contains the speech configuration parameters for the application.tokkio_rag_bot_config
- this file contains the ACE bot definition.
Plugin Server Resource Customization#
The LLM-RAG plugin server resource source code may be edited to customize the reference application.
- Follow the steps below to incorporate the customization:
Pick the Customization below
Implement the Customization
Refer to Plugin Resource Customization to reflect the customization in Tokkio deployment.
The following customizations are available:
Using a Custom RAG Pipeline
Adding Filler Sentences to the Bot Response
Changing the Avatar Name and Greeting
Updating the Bot Gestures
Interruptions
Small Talk
Proactive Bot
Using a Custom RAG Pipeline#
The rag.py file inside the /plugin directory defines a FastAPI application with an API router that manages endpoints for configuring the RAG server, querying language models, and streaming chat responses.
The /chat endpoint handles user-driven chat requests and dynamically routes them through RAG or LLM services based on configuration.
To integrate with a custom RAG server, modifications are needed in the /chat API of rag.py. Specifically, the /chat endpoint, which receives RAG requests and communicates with the RAG pipeline via the stream() function, should parse the response according to the schema of the custom RAG server.
The request to the RAG endpoint is constructed within the private stream() function. In the sample application, the request is populated with specific fields; you should update these request and response schemas as needed to align with the requirements of the custom RAG pipeline:
request_json = {
"messages": chat_history + [{"role": "user", "content": question}],
"use_knowledge_base": True,
"temperature": 0.2,
"top_p": 0.7,
"max_tokens": num_tokens,
"seed": 42,
"stop": STOP_WORDS,
"stream": True,
}
Adding Filler Sentences to the Bot Response#
Filler words/sentences can be added to a bot workflow in order to mask the latency of the fetching the response. This can be done by updating the colang files to add the flow as demonstrated below. This will let the bot say one of the filler sentences before replying with the response from the API (RAG/LLM pipeline). If the RAG response takes longer than the filler sentence, there will still be some silence. If the RAG response delay is smaller than the length of the filler sentence it will only be said after the filler sentence. The code snippet below can be added as a part of colang/main.co or a separate colang file, eg: filler_words.co. The flow needs to be activated though in order to take effect. Refer to the active flows in main.co of the bot for more information.
# Call this flow before fetching a response expected to have high latency
start bot say filler sentence as $filler
# <RAG or LLM call that is expected to be of high latency>
match $filler.Finished()
# Sample implementation of filler sentence
flow bot say filler sentence
bot say "Let me think about it"
or bot say "That's a great question! Let me find the data for you"
or bot say "Hmmm"
Changing the Avatar Name and Greeting#
The name and greetings for the bots are configured in colang/events.co of the plugin source. One can customize these flows for a greeting or name of choice. Note that in the default implementation, these flows are triggered when the user enters/leaves the camera view.
@meta(bot_intent=True)
flow bot express greeting
(bot express "Hi, I am Taylor. How can I help you?"
or bot express "Welcome! My name is Taylor. Ask away!"
or bot express "Hello! How can I help you today?")
and bot gesture "Wave with one hand"
@meta(bot_intent=True)
flow bot express goodbye
(bot say "Bye" or bot say "Goodbye" or bot say "Until next time!") and bot gesture "wave"
Updating the Bot Gestures#
One can generate gestures matching the responses spoken by the bot. A colang flow for enabling that is already provided in the colang/main.co, but not activated by default. Similarly, any other flows to generate gestures can be added to the colang files.
@loop("generate_gestures")
flow attempt to infer gesture $question $response
Interruptions#
The bot supports being interrupted by the user. This is especially useful in a RAG context where the response is long responses for certain queries. The bot supports two types of interruptions
The user can interrupt the response of the avatar by saying “stop”, “stop talking” or “cancel”. This will stop the RAG response and acknowledge the user’s request. Please note that the identification of these short requests is handled based on a regular expression that you can update in the colang/main.co file.
The avatar can also be stopped by asking a new question. In that case, the current RAG response will be stopped and the RAG will be queried with the new user request.
Small Talk#
Some small talk question-answer pairs can be added to the Colang script. The users can add their examples for small talk in a dedicated Colang file. You can find more details in the Colang Language Reference on how to handle user intents.
Proactive Bot#
By default, the bot is configured to be proactive if the user does not respond within a certain time. In that case, the RAG is queried to provide an encouraging response to the user to re-engage the user with the interaction. This behavior can be customized in the colang/main.co file
orwhen user didnt respond 20.0 #Change what happens if the user does not respond
TTS pronunciation#
The TTS pronunciation of the bot is For more information on customizing TTS pronunciation, please refer to Customizing TTS Pronunciation using IPA.
Limitations of Tokkio LLM-RAG#
By default, the RAG pipeline is designed to answer questions specifically about the uploaded documents. This ensures that the avatar remains focused on the intended topic of discussion. While this behavior can be beneficial in maintaining relevance, it also means that the avatar will not engage in small talk or general discussions.
However, this behavior can be customized to suit different use cases. You can update the prompt used to query RAG or LLM to achieve the desired outcome.
API Reference#
See LLM RAG Plugin Server APIs for more details.
Tokkio Rendering options#
Reference helm charts are made available for various rendering options for the LLM RAG bot.