RAG Chatbot
This guide helps will help jumpstart you to building RAG solutions with a Llama2 model deployed on TRT-LLM.
Please refer to the In-Depth Developer Guide for more detailed information pertaining to the technical components used within the AI workflow.
Before continuing with this guide, ensure the following prerequisites are met:
At least one NVIDIA GPU. For this guide, we used A100 data center GPU.
NVIDIA driver version 535 or newer. To check the driver version run:
nvidia-smi --query-gpu=driver_version --format=csv,noheader
.If you are running multiple GPUs they must all be set to the same mode (ie Compute vs. Display).
Setup the following:
Docker and Docker-Compose
Please refer to installation instructions.
NotePlease do not use Docker that is packaged with Ubuntu as the newer version of Docker is required for proper Docker Compose support.
Make sure your user account is able to execute Docker commands.
NVIDIA Container Toolkit
Please refer to installation instructions.
NGC Account and API Key
Please refer to instructions.
Llama2 Chat Model Weights
Download from Meta or from HuggingFace.
NoteIn this workflow, we will be leveraging a Llama2 (13B parameters) chat model, which requires 50 GB of GPU memory. If you prefer to leverage 7B parameter model, this will require 38GB memory. The 70B parameter model initially requires 240GB memory.
IMPORTANT: For this initial version of the workflow, an A100 GPU is supported.
Once the prerequisites above are met, you can run the AI workflow.
Step 1: Download the Docker Compose file
The Docker Compose files for this workflow have been published to NGC. These files can be downloaded directly using your browser.
The following command will pull the Docker Compose Files:
ngc registry resource download-version "nvaie/rag-workflow-docker-compose:1"
To access the Docker Compose files, you must be logged into NGC and have access to the Enterprise Catalog.
Step 2: Set Environment Variables
Modify
compose.env
in the root directory to set your environment variables. The following variables are required.# full path to the local copy of the model weights export MODEL_DIRECTORY="$HOME/src/Llama-2-13b-chat-hf" # the architecture of the model. eg: llama export MODEL_ARCHITECTURE="llama" # the name of the model being used - only for displaying on frontend export MODEL_NAME="llama-2-13b-chat"
Step 3: Build and Start Containers
Run the following command to build and start containers.
source compose.env; docker compose up -d
NoteIt will take a few minutes for the containers to come up and may take up to 5 minutes for the Triton server to be ready. Adding the -d flag will have the services run in the background.
Run
docker ps -a
. When the containers are ready the output should look similar to the image below.
Step 4: Experiment with RAG in JupyterLab
This AI Workflow includes Jupyter notebooks which allow you to experiment with RAG.
Using a web browser, type in the following URL to open Jupyter
http://host-ip:8888
Locate the LLM Streaming Client notebook
01-llm-streaming-client.ipynb
which demonstrates how to stream responses from the LLM.Proceed with the next 4 notebooks:
Document Question-Answering with LangChain
02_langchain_simple.ipynb
Document Question-Answering with LlamaIndex
03_llama_index_simple.ipynb
Advanced Document Question-Answering with LlamaIndex
04_llamaindex_hier_node_parser.ipynb
Interact with REST FastAPI Server
05_dataloader.ipynb
Step 5: Run the Sample Web Application
A sample chatbot web application is provided in the workflow. Requests to the chat system are wrapped in FastAPI calls.
Open the web application at
http://host-ip:8090
.Type in the following question without using a knowledge base: “How many cores are on the Nvidia Grace superchip?”
Note that the chatbot mentions the chip doesn’t exist.
To use a knowledge base:
Click the Knowledge Base tab and upload the file nvlink.pdf.
Return to Converse tab and check [X] Use knowledge base.
Retype the question: “How many cores are on the Nvidia Grace superchip?”
Uploading a file
Selecting to use the knowledge base