RAG Chatbot
This document will provide more details about the components used to build RAG solutions with a Llama2 model deployed on TRT-LLM. This guide can be used if you want to understand how the components used in the quickstart work together and if you want to have a deeper understanding of how you can adapt the pipline and its components to your own Enterprise chat bot needs.
Before continuing with this guide, ensure the following prerequisites are met:
At least one NVIDIA GPU. For this guide, we used A100 data center GPU.
NVIDIA driver version 535 or newer. To check the driver version run:
nvidia-smi --query-gpu=driver_version --format=csv,noheader
.If you are running multiple GPUs they must all be set to the same mode (ie Compute vs. Display).
Setup the following:
Docker and Docker-Compose
Please refer to installation instructions.
NotePlease do not use Docker that is packaged with Ubuntu as the newer version of Docker is required for proper Docker Compose support.
Make sure your user account is able to execute Docker commands.
NVIDIA Container Toolkit
Please refer to installation instructions.
NGC Account and API Key
Please refer to instructions.
Llama2 Chat Model Weights
Download from Meta or from HuggingFace.
NoteIn this workflow, we will be leveraging a Llama2 (13B parameters) chat model, which requires 54 GB of GPU memory. If you prefer to leverage 7B parameter model, this will require 38GB memory. The 70B parameter model initially requires 240GB memory.
IMPORTANT: For this initial version of the workflow, an A100 GPU is supported.
This reference workflow uses a variety of components and services to customize and deploy the RAG based chatbot. The following diagram illustrates how they work together:
Triton Model Server
The Triton Inference Server uses models stored in a model repository, available locally to serve inference requests. Once they are available in Triton, inference requests are sent from a client application. Python and C++ libraries provide APIs to simplify communication. Clients send HTTP/REST requests directly to Triton using HTTP/REST or gRPC protocols.
Within this workflow, the Llama2 LLM was optimized using NVIDIA TensorRT for LLMs (TRT-LLM) which accelerates and maximizes inference performance on the latest LLMs. This was automated in this workflow with the container shown below.
To convert Llama2 to TensorRT and host it on Triton Inference Server, modify and run the command below:
docker run --net host --ipc host --gpus 1 -v /path/to/llama/weights/llama-2-13b-chat/:/model --rm nvcr.io/nvaie/genai-model-server:latest llama
Be sure to update the values in the -v flag so that your mount point is set appropriately.
The llama2 weights path ends in a pattern like /llama-2-13b-chat
and you must mount the weights to /model
.
Vector DB
When content from the knowledge base is passed to an embedding model (e5-large-v2), it converts the content to vectors (referred to as “embeddings”). These embeddings are stored in a vector database. The vector DB used in this workflow is Milvus. Milvus is an open-source vector database capable of NVIDIA GPU accelerated vector searches.
If needed, see Milvus’s documentation for how a Docker Compose file can be configured for Milvus.
API Server
A sample chatbot web application is provided in the workflow so that you can test the chat system in an interactive manner. Requests to the chat system are wrapped in API calls, so these can be abstracted to other applications.
This API endpoint allows for several actions:
Uploading file(s)
Answer generation
Document search
If the API endpoint needs to be stood up manually, run the command below:
docker run -d \
--name query-router \
-p 8081:8081 \
--expose 8081 \
--shm-size 5g \
-e APP_MILVUS_URL="http://milvus:19530" \
-e APP_TRITON_SERVERURL="triton:8001" \
-e APP_TRITON_MODELNAME=ensemble \
nvcr.io/nvaie/chain-server:latest \
--port 8081 --host 0.0.0.0
The API server can be visualized at host-ip:8081/docs
. The following sections describe the API endpoint actions further.
Upload File Endpoint
Summary: Upload a file.
Endpoint: /uploadDocument
HTTP Method: POST
Request:
Content-Type: multipart/form-data
Schema:
Body_upload_file_uploadDocument_post
Required: Yes
Request Body Parameters:
- file
(Type: File) - The file to be uploaded.
Responses: - 200 - Successful Response
Description: The file was successfully uploaded.
Response Body: Empty
422 - Validation Error
Description: There was a validation error with the request.
Response Body: Details of the validation error.
Answer Generation Endpoint
Summary: Generate an answer to a question.
Endpoint: /generate
HTTP Method: POST
Operation ID: generate_answer_generate_post
Request:
Content-Type: application/json
Schema:
Prompt
Required: Yes
Request Body Parameters:
question
(Type: string) - The question you want to ask.context
(Type: string) - Additional context for the question (optional).use_knowledge_base
(Type: boolean, Default: true) - Whether to use a knowledge base.model_name
(Type: string, Default: “llama2-7B-chat”) - The name of the language model to use.num_tokens
(Type: integer, Default: 500) - The maximum number of tokens in the response.
Responses:
200 - Successful Response
Description: The answer was successfully generated.
Response Body: An object containing the generated answer.
422 - Validation Error
Description: There was a validation error with the request.
Response Body: Details of the validation error.
Document Search Endpoint
Summary: Search for documents based on content.
Endpoint: /documentSearch
HTTP Method: POST
Operation ID: document_search_documentSearch_post
Request:
Content-Type: application/json
Schema:
DocumentSearch
Required: Yes
Request Body Parameters:
content
(Type: string) - The content or keywords to search for within documents.num_docs
(Type: integer, Default: 4) - The maximum number of documents to return in the response.
Responses:
200 - Successful Response
Description: Documents matching the search criteria were found.
Response Body: An object containing the search results.
422 - Validation Error
Description: There was a validation error with the request.
Response Body: Details of the validation error.
The web frontend provides a UI on top of the APIs. Users can chat with the LLM and see responses streamed back. By selecting “Use knowledge base,” the chatbot returns responses augmented with the data that’s been stored in the vector database. To store content in the vector database, change the window to “Knowledge Base” in the upper right corner and upload documents.
If the web frontend needs to be stood up manually, run the following:
docker run -d \
--name llm-playground \
-p 8090:8090 \
--expose 8090 \
-e APP_SERVERURL=http://query \
-e APP_SERVERPORT=8081 \
-e APP_MODELNAME=${MODEL_NAME:-${MODEL_ARCHITECTURE}} \
nvcr.io/nvaie/genai-llm-playground:latest \
--port 8090
For development and experimentation purposes, the Jupyter notebooks provide guidance to building knowledge augmented chatbots.
If a JupyterLab server needs to be stood up manually, run the following:]
docker run -d \
--name jupyter-notebook-server \d
-p 8888:8888 \
--expose 8888 \
--gpus all \
nvcr.io/nvaie/genai-notebook-server:latest
The following Jupyter notebooks are provided with the AI workflow:
LLM Streaming Client
01-llm-streaming-client.ipynb
This notebook demonstrates how to use a client to stream responses from an LLM deployed to NVIDIA Triton Inference Server with NVIDIA TensorRT-LLM (TRT-LLM). This deployment format optimizes the model for low latency and high throughput inference.
Document Question-Answering with LangChain
02_langchain_simple.ipynb
This notebook demonstrates how to use LangChain to build a chatbot that references a custom knowledge-base. LangChain provides a simple framework for connecting LLMs to your own data sources. It shows how to integrate a TensorRT-LLM to LangChain using a custom wrapper.
Document Question-Answering with LlamaIndex
03_llama_index_simple.ipynb
This notebook demonstrates how to use LlamaIndex to build a chatbot that references a custom knowledge-base. It contains the same functionality as this notebook before, but uses some LlamaIndex components instead of LangChain components. It also shows how the two frameworks can be used together.
Advanced Document Question-Answering with LlamaIndex
04_llamaindex_hier_node_parser.ipynb
This notebook demonstrates how to use LlamaIndex to build a more complex retrieval for a chatbot. The retrieval method shown in this notebook works well for code documentation; it retrieves more contiguous document blocks that preserve both code snippets and explanations of code.
Interact with REST FastAPI Server
05_dataloader.ipynb
This notebook demonstrates how to use the REST FastAPI server to upload the knowledge base and then ask a question without and with the knowledge base.