In-Depth Developer Guide

RAG Chatbot

This document will provide more details about the components used to build RAG solutions with a Llama2 model deployed on TRT-LLM. This guide can be used if you want to understand how the components used in the quickstart work together and if you want to have a deeper understanding of how you can adapt the pipline and its components to your own Enterprise chat bot needs.

Before continuing with this guide, ensure the following prerequisites are met:

  • At least one NVIDIA GPU. For this guide, we used A100 data center GPU.

    • NVIDIA driver version 535 or newer. To check the driver version run: nvidia-smi --query-gpu=driver_version --format=csv,noheader.

    • If you are running multiple GPUs they must all be set to the same mode (ie Compute vs. Display).

Setup the following:

  • Docker and Docker-Compose

    Note

    Please do not use Docker that is packaged with Ubuntu as the newer version of Docker is required for proper Docker Compose support.

    Make sure your user account is able to execute Docker commands.


  • NVIDIA Container Toolkit

  • NGC Account and API Key

  • Llama2 Chat Model Weights

    Note

    In this workflow, we will be leveraging a Llama2 (13B parameters) chat model, which requires 54 GB of GPU memory. If you prefer to leverage 7B parameter model, this will require 38GB memory. The 70B parameter model initially requires 240GB memory.

    IMPORTANT: For this initial version of the workflow, an A100 GPU is supported.


This reference workflow uses a variety of components and services to customize and deploy the RAG based chatbot. The following diagram illustrates how they work together:

graph LR subgraph Demo Application frontend[Web Frontend]-->api[API Server] hosting[Triton Model Server] vector[Vector DB]:::gray api-->hosting api-->vector data_source[(Data Sources)]:::gray-->vector end jupyter[Jupyter Server] classDef default fill:#fff,stroke:#8c8c8c,stroke-width:2px,color:#75b900,font-size:16px; classDef node fill:#76b900,stroke:#8c8c8c,stroke-width:3px,color:#000,font-size:16px; classDef gray fill:#cdcdcd,stroke:#8c8c8c,stroke-width:3px,color:#000,font-size:16px; classDef footnote stroke-width:0px,fill:transparent,color:#8c8c8c,font-size:16px; classDef default fill:#fff,stroke:#8c8c8c,stroke-width:2px,color:#75b900,font-size:16px; classDef node fill:#cdcdcd,stroke:#8c8c8c,stroke-width:3px,color:#000,font-size:16px; classDef footnote stroke-width:0px,fill:transparent,color:#8c8c8c,font-size:16px;

Triton Model Server

The Triton Inference Server uses models stored in a model repository, available locally to serve inference requests. Once they are available in Triton, inference requests are sent from a client application. Python and C++ libraries provide APIs to simplify communication. Clients send HTTP/REST requests directly to Triton using HTTP/REST or gRPC protocols.

Within this workflow, the Llama2 LLM was optimized using NVIDIA TensorRT for LLMs (TRT-LLM) which accelerates and maximizes inference performance on the latest LLMs. This was automated in this workflow with the container shown below.

To convert Llama2 to TensorRT and host it on Triton Inference Server, modify and run the command below:

Copy
Copied!
            

docker run --net host --ipc host --gpus 1 -v /path/to/llama/weights/llama-2-13b-chat/:/model --rm nvcr.io/nvaie/genai-model-server:latest llama

Note

Be sure to update the values in the -v flag so that your mount point is set appropriately. The llama2 weights path ends in a pattern like /llama-2-13b-chat and you must mount the weights to /model.

Vector DB

When content from the knowledge base is passed to an embedding model (e5-large-v2), it converts the content to vectors (referred to as “embeddings”). These embeddings are stored in a vector database. The vector DB used in this workflow is Milvus. Milvus is an open-source vector database capable of NVIDIA GPU accelerated vector searches.

Note

If needed, see Milvus’s documentation for how a Docker Compose file can be configured for Milvus.

API Server

A sample chatbot web application is provided in the workflow so that you can test the chat system in an interactive manner. Requests to the chat system are wrapped in API calls, so these can be abstracted to other applications.

This API endpoint allows for several actions:

  • Uploading file(s)

  • Answer generation

  • Document search

If the API endpoint needs to be stood up manually, run the command below:

Copy
Copied!
            

docker run -d \ --name query-router \ -p 8081:8081 \ --expose 8081 \ --shm-size 5g \ -e APP_MILVUS_URL="http://milvus:19530" \ -e APP_TRITON_SERVERURL="triton:8001" \ -e APP_TRITON_MODELNAME=ensemble \ nvcr.io/nvaie/chain-server:latest \ --port 8081 --host 0.0.0.0

The API server can be visualized at host-ip:8081/docs. The following sections describe the API endpoint actions further.

Upload File Endpoint

Summary: Upload a file.

Endpoint: /uploadDocument

HTTP Method: POST

Request:

  • Content-Type: multipart/form-data

  • Schema: Body_upload_file_uploadDocument_post

  • Required: Yes

Request Body Parameters: - file (Type: File) - The file to be uploaded.

Responses: - 200 - Successful Response

  • Description: The file was successfully uploaded.

  • Response Body: Empty

  • 422 - Validation Error

    • Description: There was a validation error with the request.

    • Response Body: Details of the validation error.

Answer Generation Endpoint

Summary: Generate an answer to a question.

Endpoint: /generate

HTTP Method: POST

Operation ID: generate_answer_generate_post

Request:

  • Content-Type: application/json

  • Schema: Prompt

  • Required: Yes

Request Body Parameters:

  • question (Type: string) - The question you want to ask.

  • context (Type: string) - Additional context for the question (optional).

  • use_knowledge_base (Type: boolean, Default: true) - Whether to use a knowledge base.

  • model_name (Type: string, Default: “llama2-7B-chat”) - The name of the language model to use.

  • num_tokens (Type: integer, Default: 500) - The maximum number of tokens in the response.

Responses:

  • 200 - Successful Response

    • Description: The answer was successfully generated.

    • Response Body: An object containing the generated answer.

  • 422 - Validation Error

    • Description: There was a validation error with the request.

    • Response Body: Details of the validation error.

Document Search Endpoint

Summary: Search for documents based on content.

Endpoint: /documentSearch HTTP Method: POST

Operation ID: document_search_documentSearch_post

Request:

  • Content-Type: application/json

  • Schema: DocumentSearch

  • Required: Yes

Request Body Parameters:

  • content (Type: string) - The content or keywords to search for within documents.

  • num_docs (Type: integer, Default: 4) - The maximum number of documents to return in the response.

Responses:

  • 200 - Successful Response

    • Description: Documents matching the search criteria were found.

    • Response Body: An object containing the search results.

  • 422 - Validation Error

    • Description: There was a validation error with the request.

    • Response Body: Details of the validation error.

The web frontend provides a UI on top of the APIs. Users can chat with the LLM and see responses streamed back. By selecting “Use knowledge base,” the chatbot returns responses augmented with the data that’s been stored in the vector database. To store content in the vector database, change the window to “Knowledge Base” in the upper right corner and upload documents.

If the web frontend needs to be stood up manually, run the following:

Copy
Copied!
            

docker run -d \ --name llm-playground \ -p 8090:8090 \ --expose 8090 \ -e APP_SERVERURL=http://query \ -e APP_SERVERPORT=8081 \ -e APP_MODELNAME=${MODEL_NAME:-${MODEL_ARCHITECTURE}} \ nvcr.io/nvaie/genai-llm-playground:latest \ --port 8090

For development and experimentation purposes, the Jupyter notebooks provide guidance to building knowledge augmented chatbots.

If a JupyterLab server needs to be stood up manually, run the following:]

Copy
Copied!
            

docker run -d \ --name jupyter-notebook-server \d -p 8888:8888 \ --expose 8888 \ --gpus all \ nvcr.io/nvaie/genai-notebook-server:latest

The following Jupyter notebooks are provided with the AI workflow:

  1. LLM Streaming Client 01-llm-streaming-client.ipynb

This notebook demonstrates how to use a client to stream responses from an LLM deployed to NVIDIA Triton Inference Server with NVIDIA TensorRT-LLM (TRT-LLM). This deployment format optimizes the model for low latency and high throughput inference.

  1. Document Question-Answering with LangChain 02_langchain_simple.ipynb

This notebook demonstrates how to use LangChain to build a chatbot that references a custom knowledge-base. LangChain provides a simple framework for connecting LLMs to your own data sources. It shows how to integrate a TensorRT-LLM to LangChain using a custom wrapper.

  1. Document Question-Answering with LlamaIndex 03_llama_index_simple.ipynb

This notebook demonstrates how to use LlamaIndex to build a chatbot that references a custom knowledge-base. It contains the same functionality as this notebook before, but uses some LlamaIndex components instead of LangChain components. It also shows how the two frameworks can be used together.

  1. Advanced Document Question-Answering with LlamaIndex 04_llamaindex_hier_node_parser.ipynb

This notebook demonstrates how to use LlamaIndex to build a more complex retrieval for a chatbot. The retrieval method shown in this notebook works well for code documentation; it retrieves more contiguous document blocks that preserve both code snippets and explanations of code.

  1. Interact with REST FastAPI Server 05_dataloader.ipynb

This notebook demonstrates how to use the REST FastAPI server to upload the knowledge base and then ask a question without and with the knowledge base.

© Copyright 2022-2023, NVIDIA. Last updated on Nov 20, 2023.