RAG Bot#

This is an example chatbot that showcases RAG (retrieval augmented generation). This bot interacts with the RAG chain server’s /generate endpoint using the Plugin server to answer questions based on the ingested documents.

The RAG bot showcases the following ACE Agent features:

Integrating a RAG example from NVIDIA’s Generative AI Examples
Low latency using ASR 2 pass End of Utterance (EOU)
Always-on Barge-In support
Handling conversation history in plugins
Streaming the JSON response from Plugin server
Support deployment using Event Architecture

Check the tutorial Building a Low Latency Speech RAG Bot for an in-depth understanding of the bot implementation.

The RAG sample bot is present in the quickstart directory at ./samples/rag_bot/.

RAG Chain server deployment

This sample bot requires deployment of one of the RAG examples from NVIDIA’s Generative AI Examples as a prerequisite. If you want to use any custom RAG solution, you will need to make updates in ./samples/rag_bot/plugins/rag.py. You can utilize any one of the following three options for trying out the sample bot.

Deploy RAG with NIM hosted meta/llama3-8b-instruct LLM and embedding model endpoints.
1. Get your NVIDIA API key.
  Go to the NVIDIA API Catalog.
  
  Select any model.
  
  Click Get API Key.
2. Export the required environment variables.
  export DOCKER_VOLUME_DIRECTORY=$(pwd) export MODEL_DIRECTORY=$(pwd) export NVIDIA_API_KEY=...
3. Update samples/rag_bot/RAG/basic_rag/docker-compose.yaml to not use local LLM and embedding endpoints. To use NIM hosted models, set APP_LLM_SERVERURL and APP_EMBEDDINGS_SERVERURL to a blank string in the chain-server service environment variables.
  APP_LLM_SERVERURL: "" APP_EMBEDDINGS_SERVERURL: ""
4. Deploy the rag-application and vector-db containers.
  docker compose -f samples/rag_bot/RAG/basic_rag/docker-compose.yaml up -d --build
Deploy RAG with local hosted meta/llama3-8b-instruct LLM and embedding model endpoints.
1. Local deployment of LLM and embedding models will need a separate A100 or H100 GPU device.
2. Export the required environment variables.
  export DOCKER_VOLUME_DIRECTORY=$(pwd) export MODEL_DIRECTORY=$(pwd)export NGC_CLI_API_KEY=...
3. Update samples/rag_bot/RAG/basic_rag/docker-compose.yaml to not use local LLM and embedding endpoints. Set APP_LLM_SERVERURL and APP_EMBEDDINGS_SERVERURL to nemollm-inference:8000 and nemollm-embedding:8000 respectively in the chain-server service environment variables.
  APP_LLM_SERVERURL: "nemollm-inference:8000" APP_EMBEDDINGS_SERVERURL: "nemollm-embedding:8000"
4. Deploy the rag-application and vector-db containers along with NIM LLM and embedding microservice.
  docker compose -f samples/rag_bot/RAG/basic_rag/docker-compose.yaml --profile local-nim --profile milvus up -d
Deploy RAG with NIM hosted Small Language model nvidia/nemotron-mini-4b-instruct and embedding model endpoints.
1. Get your NVIDIA API key.
  Go to the NVIDIA API Catalog.
  
  Select any model.
  
  Click Get API Key.
2. Export the required environment variables.
  export DOCKER_VOLUME_DIRECTORY=$(pwd) export NVIDIA_API_KEY=...
3. Deploy the rag-application and vector-db containers.
  docker compose -f samples/rag_bot/RAG/slm_rag/docker-compose.yaml up -d --build

For more information about the deployment steps, refer to GenerativeAIExamples.

Ingest documents as required for your use case by visiting http://<your-ip>:3001/kb.

Refer to the Troubleshooting section if you encounter any issues or errors.

Docker-based bot deployment

Prepare the environment for the Docker compose commands.

export BOT_PATH=./samples/rag_bot/
source deploy/docker/docker_init.sh

Deploy the Speech models.

docker compose -f deploy/docker/docker-compose.yml up model-utils-speech

If you have deployed RAG Chain Server on a different machine, update RAG_SERVER_URL in ./samples/rag_bot/plugin_config.yaml.
Deploy the ACE Agent microservices. Deploy the Chat Controller, Chat Engine, Plugin server, and WebUI microservices.
docker compose -f deploy/docker/docker-compose.yml up speech-event-bot -d
Wait for a few minutes for all services to be ready. You can check the Docker logs for individual microservices to confirm. You will see log print Server listening on 0.0.0.0:50055 in the Docker logs for the Chat Controller container.
Interact with the bot using the URL http://<workstation IP>:7006/.

For accessing the mic on the browser, we need to either convert http to https endpoint by adding SSL validation or update your chrome://flags/ or edge://flags/ to allow http://<workstation IP>:7006 as a secure endpoint.

Note

We will send an early trigger to RAG server and might need to retrigger if the user takes more than 240 ms pause between words. On average, you might do 2 extra RAG calls for each user query which will require extra compute/cost for deploying on scale.