RAG Bot#
This is an example chatbot that showcases RAG (retrieval augmented generation). This bot interacts with the RAG chain server’s /generate
endpoint using the Plugin server to answer questions based on the ingested documents.
The RAG bot showcases the following ACE Agent features:
Integrating a RAG example from NVIDIA’s Generative AI Examples
Low latency using ASR 2 pass End of Utterance (EOU)
Always-on Barge-In support
Handling conversation history in plugins
Streaming the JSON response from Plugin server
Support deployment using Event Architecture
Check the tutorial Building a Low Latency Speech RAG Bot for an in-depth understanding of the bot implementation.
The RAG sample bot is present in the quickstart directory at ./samples/rag_bot/
.
RAG Chain server deployment
This sample bot requires deployment of one of the RAG examples from NVIDIA’s Generative AI Examples as a prerequisite. If you want to use any custom RAG solution, you will need to make updates in ./samples/rag_bot/plugins/rag.py
. You can utilize any one of the following three options for trying out the sample bot.
Deploy RAG with NIM hosted
meta/llama3-8b-instruct
LLM and embedding model endpoints.Get your NVIDIA API key.
Go to the NVIDIA API Catalog.
Select any model.
Click Get API Key.
Export the required environment variables.
export DOCKER_VOLUME_DIRECTORY=$(pwd) export MODEL_DIRECTORY=$(pwd) export NVIDIA_API_KEY=...
Update
samples/rag_bot/RAG/basic_rag/docker-compose.yaml
to not use local LLM and embedding endpoints. To use NIM hosted models, setAPP_LLM_SERVERURL
andAPP_EMBEDDINGS_SERVERURL
to a blank string in thechain-server
service environment variables.APP_LLM_SERVERURL: "" APP_EMBEDDINGS_SERVERURL: ""
Deploy the
rag-application
andvector-db
containers.docker compose -f samples/rag_bot/RAG/basic_rag/docker-compose.yaml up -d --build
Deploy RAG with local hosted
meta/llama3-8b-instruct
LLM and embedding model endpoints.Local deployment of LLM and embedding models will need a separate A100 or H100 GPU device.
Export the required environment variables.
export DOCKER_VOLUME_DIRECTORY=$(pwd) export MODEL_DIRECTORY=$(pwd)export NGC_CLI_API_KEY=...
Update
samples/rag_bot/RAG/basic_rag/docker-compose.yaml
to not use local LLM and embedding endpoints. SetAPP_LLM_SERVERURL
andAPP_EMBEDDINGS_SERVERURL
tonemollm-inference:8000
andnemollm-embedding:8000
respectively in thechain-server
service environment variables.APP_LLM_SERVERURL: "nemollm-inference:8000" APP_EMBEDDINGS_SERVERURL: "nemollm-embedding:8000"
Deploy the
rag-application
andvector-db
containers along with NIM LLM and embedding microservice.docker compose -f samples/rag_bot/RAG/basic_rag/docker-compose.yaml --profile local-nim --profile milvus up -d
Deploy RAG with NIM hosted Small Language model
nvidia/nemotron-mini-4b-instruct
and embedding model endpoints.Get your NVIDIA API key.
Go to the NVIDIA API Catalog.
Select any model.
Click Get API Key.
Export the required environment variables.
export DOCKER_VOLUME_DIRECTORY=$(pwd) export NVIDIA_API_KEY=...
Deploy the
rag-application
andvector-db
containers.docker compose -f samples/rag_bot/RAG/slm_rag/docker-compose.yaml up -d --build
For more information about the deployment steps, refer to GenerativeAIExamples.
Ingest documents as required for your use case by visiting http://<your-ip>:3001/kb
.
Refer to the Troubleshooting section if you encounter any issues or errors.
Docker-based bot deployment
Prepare the environment for the Docker compose commands.
export BOT_PATH=./samples/rag_bot/ source deploy/docker/docker_init.sh
Deploy the Speech models.
docker compose -f deploy/docker/docker-compose.yml up model-utils-speech
If you have deployed RAG Chain Server on a different machine, update
RAG_SERVER_URL
in./samples/rag_bot/plugin_config.yaml
.Deploy the ACE Agent microservices. Deploy the Chat Controller, Chat Engine, Plugin server, and WebUI microservices.
docker compose -f deploy/docker/docker-compose.yml up speech-event-bot -d
Wait for a few minutes for all services to be ready. You can check the Docker logs for individual microservices to confirm. You will see log print
Server listening on 0.0.0.0:50055
in the Docker logs for the Chat Controller container.Interact with the bot using the URL
http://<workstation IP>:7006/
.For accessing the mic on the browser, we need to either convert
http
tohttps
endpoint by adding SSL validation or update yourchrome://flags/
oredge://flags/
to allowhttp://<workstation IP>:7006
as a secure endpoint.
Note
We will send an early trigger to RAG server and might need to retrigger if the user takes more than 240 ms pause between words. On average, you might do 2 extra RAG calls for each user query which will require extra compute/cost for deploying on scale.