Tokkio LLM-RAG#
Introduction#
The Tokkio LLM-RAG sample application is designed to facilitate avatar live interactions using popular Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) solutions. It seamlessly integrates with platforms such as Nvidia NIM, OpenAI, and Nvidia Enterprise RAG, along with Nvidia Generative AI examples, providing out-of-the-box support.
Key Features
Avatar Interaction: Tokkio manages the avatar interaction, while the LLM or RAG handles the conversational content. These components are distinct but communicate via REST API.
Customization and Integration: Users can customize the application to meet specific requirements and connect it to third-party LLMs or RAGs. This flexibility is detailed in the “Customize the Pipeline” section.
Architectural Enhancements: The application showcases improvements such as easy switching between different LLM models, implementing custom moderation policies, and tuning the plugin server as per user needs. Users can integrate their own RAG with the Tokkio pipeline effortlessly.
Benefits
Separation of Concerns: By separating avatar interaction from text generation, Tokkio allows for modular development and easier maintenance.
Ease of Use: The application supports straightforward integration and customization, enabling users to tailor the system to specific use cases without extensive technical overhead.
Advanced Capabilities: With features like gesture generation and response streaming, Tokkio enhances user interaction quality and system responsiveness.
Source#
The source for this resource can be found on NGC
Implementation details#
The Tokkio LLM-RAG ace bot contains the following files:
tokkio_rag_bot_config
- this file contains the ACE bot definitionspeech_config.yaml
- this file contains the speech configuration parameters for the applicationplugin_config.yaml
- this file contains the definition of the ACE plugin server configurationaction.py
- this contains custom actions that the Colang code can call like the plugin-serverplugin/rag.py
- this file contains the definition of the Python client of the external RAG or LLM. The /chat endpoint is implemented in this file.plugin/schemas.py
- this file contains the schema definitions for request/response schemas expected by the /chat endpoint.model_config.yaml
- The definition of the model to use. This can be overwritten in the UCS appcmudict_ipa.txt
- It allows to customize the pronunciation of some specific wordsasr_words_to_boost_conformer.txt
- a default list of words for asr word boosting.
All the files in the colang
folder define the flow for handling various queries in the LLM bot with multimodality.
RAG pipeline#
To use the Tokkio LLM-RAG reference app with a RAG pipeline, the user needs to deploy a RAG server, ingest the desired documents in the RAG and configure Tokkio to use the RAG endpoint.
The RAG supported out of the box adheres to the schema from Nvidia Generative AI example
Please refer to Nvidia Generative AI examples for information regarding the deployment and document ingestion.
Using a custom RAG pipeline#
Modification is needed in the /chat API of rag.py to achieve this. The /chat endpoint receives the rag requests and communicates with the RAG pipeline via stream() in the reference application to obtain a response. Ensure that the response is parsed per the response schema of the custom RAG server.
The request to the RAG endpoint is populated in the private API stream() of the rag.py. In the sample application, the request is populated with the following fields. Update the request or the response schemas as needed by the custom RAG pipeline:
request_json = {
"messages": chat_history + [{"role": "user", "content": question}],
"use_knowledge_base": True,
"temperature": 0.2,
"top_p": 0.7,
"max_tokens": num_tokens,
"seed": 42,
"stop": STOP_WORDS,
"stream": True,
}
Adding filler sentences to bot response#
Filler words/sentences can be added to a bot workflow in order to mask the latency of the fetching the response. This can be done by updating the colang files to add the flow as demonstrated below. This will let the bot say one of the filler sentences before replying with the response from the API (RAG/LLM pipeline). If the RAG response takes longer than the filler sentence, there will still be some silence. If the RAG response delay is smaller than the length of the filler sentence it will only be said after the filler sentence.
# Call this flow before fetching a response expected to have high latency
start bot say filler sentence as $filler
# <RAG or LLM call that is expected to be of high latency>
match $filler.Finished()
# Sample implementation of filler sentence
flow bot say filler sentence
bot say "Let me think about it"
or bot say "That's a great question! Let me find the data for you"
or bot say "Hmmm"
Changing the avatar name and greeting#
The name and greetings for the bots are configured in colang/events.co of the plugin source. One can customize these flows for a greeting or name of choice. Note that in the default implementation, these flows are triggered when the user enters/leaves the camera view.
@meta(bot_intent=True)
flow bot express greeting
(bot express "Hi, I am Taylor. How can I help you?"
or bot express "Welcome! My name is Taylor. Ask away!"
or bot express "Hello! How can I help you today?")
and bot gesture "Wave with one hand"
@meta(bot_intent=True)
flow bot express goodbye
(bot say "Bye" or bot say "Goodbye" or bot say "Until next time!") and bot gesture "wave"
Updating bot gestures#
One can generate gestures matching the responses spoken by the bot. A colang flow for enabling that is already provided, but not activated by default. Similarly, any other flows to generate gestures can be added to the colang files.
@loop("generate_gestures")
flow attempt to infer gesture $question $response
Interruptions#
The bot supports being interrupted by the user. This is especially useful in a RAG context where the response is long responses for certain queries. The bot supports two types of interruptions
The user can interrupt the response of the avatar by saying “stop”, “stop talking” or “cancel”. This will stop the RAG response and acknowledge the user’s request. Please note that the identification of these short requests is handled based on a regular expression that you can update in the colang/main.co file.
The avatar can also be stopped by asking a new question. In that case, the current RAG response will be stopped and the RAG will be queried with the new user request.
Small Talk#
Some small talk question-answer pairs can be added to the Colang script. The users can add their examples for small talk in a dedicated Colang file. You can find more details in the Colang Language Reference on how to handle user intents.
Proactive bot#
By default, the bot is configured to be proactive if the user does not respond within a certain time. In that case, the RAG is queried to provide an encouraging response to the user to re-engage the user with the interaction. This behavior can be customized in the colang/main.co file
orwhen user didnt respond 20.0 #Change what happens if the user does not respond
Limitations#
By default, the RAG pipeline is designed to answer questions specifically about the uploaded documents. This ensures that the avatar remains focused on the intended topic of discussion. While this behavior can be beneficial in maintaining relevance, it also means that the avatar will not engage in small talk or general discussions.
However, this behavior can be customized to suit different use cases. You can update the prompt used to query RAG or LLM to achieve the desired outcome.
Publish the bot#
Once the ACE bot is customized it can be pushed to NGC using the following command. The BOT_FOLDER_NAME refers to the location of the folder to be uploaded to NGC:
$ ngc registry resource upload-version --source BOT_FOLDER_NAME targeted_ngc_path:version
The published bot can be used for customized application deployment. Plugin Resource Customization
API Reference#
See LLM RAG Plugin Server APIs for more details.
Tokkio LLM RAG Rendering options#
Reference helm charts are made available for various rendering options for the LLM RAG bot.
- Tokkio LLM-RAG - Omniverse Renderer
- Tokkio LLM-RAG - Unreal Engine
- Tokkio LLM-RAG - A2F-2D