RAG Bot#
This is an example chatbot that showcases RAG (retrieval augmented generation). This bot interacts with the RAG chain server’s /generate endpoint using the Plugin server to answer questions based on the ingested documents.
The RAG bot showcases the following ACE Agent features:
Integrating a RAG example from NVIDIA’s Generative AI Examples
Low latency using ASR 2 pass End of Utterance (EOU)
Always-on Barge-In support
Handling conversation history in plugins
Streaming the JSON response from Plugin server
Support deployment using Event Architecture
Check the tutorial Building a Low Latency Speech RAG Bot for an in-depth understanding of the bot implementation.
The RAG sample bot is present in the quickstart directory at ./samples/rag_bot/.
RAG Chain server deployment
This sample bot requires deployment of one of the RAG examples from NVIDIA’s Generative AI Examples as a prerequisite. If you want to use any custom RAG solution, you will need to make updates in ./samples/rag_bot/plugins/rag.py. You can utilize any one of the following three options for trying out the sample bot.
Deploy RAG with NIM hosted
meta/llama3-8b-instructLLM and embedding model endpoints.Get your NVIDIA API key.
Go to the NVIDIA API Catalog.
Select any model.
Click Get API Key.
Export the required environment variables.
export DOCKER_VOLUME_DIRECTORY=$(pwd) export MODEL_DIRECTORY=$(pwd) export NVIDIA_API_KEY=...
Update
samples/rag_bot/RAG/basic_rag/docker-compose.yamlto not use local LLM and embedding endpoints. To use NIM hosted models, setAPP_LLM_SERVERURLandAPP_EMBEDDINGS_SERVERURLto a blank string in thechain-serverservice environment variables.APP_LLM_SERVERURL: "" APP_EMBEDDINGS_SERVERURL: ""
Deploy the
rag-applicationandvector-dbcontainers.docker compose -f samples/rag_bot/RAG/basic_rag/docker-compose.yaml up -d --build
Deploy RAG with local hosted
meta/llama3-8b-instructLLM and embedding model endpoints.Local deployment of LLM and embedding models will need a separate A100 or H100 GPU device.
Export the required environment variables.
export DOCKER_VOLUME_DIRECTORY=$(pwd) export MODEL_DIRECTORY=$(pwd)export NGC_CLI_API_KEY=...
Update
samples/rag_bot/RAG/basic_rag/docker-compose.yamlto not use local LLM and embedding endpoints. SetAPP_LLM_SERVERURLandAPP_EMBEDDINGS_SERVERURLtonemollm-inference:8000andnemollm-embedding:8000respectively in thechain-serverservice environment variables.APP_LLM_SERVERURL: "nemollm-inference:8000" APP_EMBEDDINGS_SERVERURL: "nemollm-embedding:8000"
Deploy the
rag-applicationandvector-dbcontainers along with NIM LLM and embedding microservice.docker compose -f samples/rag_bot/RAG/basic_rag/docker-compose.yaml --profile local-nim --profile milvus up -d
Deploy RAG with NIM hosted Small Language model
nvidia/nemotron-mini-4b-instructand embedding model endpoints.Get your NVIDIA API key.
Go to the NVIDIA API Catalog.
Select any model.
Click Get API Key.
Export the required environment variables.
export DOCKER_VOLUME_DIRECTORY=$(pwd) export NVIDIA_API_KEY=...
Deploy the
rag-applicationandvector-dbcontainers.docker compose -f samples/rag_bot/RAG/slm_rag/docker-compose.yaml up -d --build
For more information about the deployment steps, refer to GenerativeAIExamples.
Ingest documents as required for your use case by visiting http://<your-ip>:3001/kb.
Refer to the Troubleshooting section if you encounter any issues or errors.
Docker-based bot deployment
Prepare the environment for the Docker compose commands.
export BOT_PATH=./samples/rag_bot/ source deploy/docker/docker_init.sh
Deploy the Speech models.
docker compose -f deploy/docker/docker-compose.yml up model-utils-speech
If you have deployed RAG Chain Server on a different machine, update
RAG_SERVER_URLin./samples/rag_bot/plugin_config.yaml.Deploy the ACE Agent microservices. Deploy the Chat Controller, Chat Engine, Plugin server, and WebUI microservices.
docker compose -f deploy/docker/docker-compose.yml up speech-event-bot -d
Wait for a few minutes for all services to be ready. You can check the Docker logs for individual microservices to confirm. You will see log print
Server listening on 0.0.0.0:50055in the Docker logs for the Chat Controller container.Interact with the bot using the URL
http://<workstation IP>:7006/.For accessing the mic on the browser, we need to either convert
httptohttpsendpoint by adding SSL validation or update yourchrome://flags/oredge://flags/to allowhttp://<workstation IP>:7006as a secure endpoint.
Note
We will send an early trigger to RAG server and might need to retrigger if the user takes more than 240 ms pause between words. On average, you might do 2 extra RAG calls for each user query which will require extra compute/cost for deploying on scale.