Quickstart Developer Guide

RAG Chatbot

This guide helps will help jumpstart you to building RAG solutions with a Llama2 model deployed on TRT-LLM.

Please refer to the In-Depth Developer Guide for more detailed information pertaining to the technical components used within the AI workflow.

Before continuing with this guide, ensure the following prerequisites are met:

  • At least one NVIDIA GPU. For this guide, we used A100 data center GPU.

    • NVIDIA driver version 535 or newer. To check the driver version run: nvidia-smi --query-gpu=driver_version --format=csv,noheader.

    • If you are running multiple GPUs they must all be set to the same mode (ie Compute vs. Display).

Setup the following:

  • Docker and Docker-Compose

    Note

    Please do not use Docker that is packaged with Ubuntu as the newer version of Docker is required for proper Docker Compose support.

    Make sure your user account is able to execute Docker commands.


  • NVIDIA Container Toolkit

  • NGC Account and API Key

  • Llama2 Chat Model Weights

    Note

    In this workflow, we will be leveraging a Llama2 (13B parameters) chat model, which requires 50 GB of GPU memory. If you prefer to leverage 7B parameter model, this will require 38GB memory. The 70B parameter model initially requires 240GB memory.

    IMPORTANT: For this initial version of the workflow, an A100 GPU is supported.


Once the prerequisites above are met, you can run the AI workflow.

Step 1: Download the Docker Compose file

The Docker Compose files for this workflow have been published to NGC. These files can be downloaded directly using your browser.

The following command will pull the Docker Compose Files:

Copy
Copied!
            

ngc registry resource download-version "nvaie/rag-workflow-docker-compose:1"

Note

To access the Docker Compose files, you must be logged into NGC and have access to the Enterprise Catalog.

Step 2: Set Environment Variables

  1. Modify compose.env in the root directory to set your environment variables. The following variables are required.

    Copy
    Copied!
                

    # full path to the local copy of the model weights export MODEL_DIRECTORY="$HOME/src/Llama-2-13b-chat-hf" # the architecture of the model. eg: llama export MODEL_ARCHITECTURE="llama" # the name of the model being used - only for displaying on frontend export MODEL_NAME="llama-2-13b-chat"


Step 3: Build and Start Containers

  1. Run the following command to build and start containers.

    Copy
    Copied!
                

    source compose.env; docker compose up -d

    Note

    It will take a few minutes for the containers to come up and may take up to 5 minutes for the Triton server to be ready. Adding the -d flag will have the services run in the background.


  2. Run docker ps -a. When the containers are ready the output should look similar to the image below.

    docker-output.png


Step 4: Experiment with RAG in JupyterLab

This AI Workflow includes Jupyter notebooks which allow you to experiment with RAG.

  1. Using a web browser, type in the following URL to open Jupyter

    http://host-ip:8888

  2. Locate the LLM Streaming Client notebook 01-llm-streaming-client.ipynb which demonstrates how to stream responses from the LLM.

  3. Proceed with the next 4 notebooks:

    • Document Question-Answering with LangChain 02_langchain_simple.ipynb

    • Document Question-Answering with LlamaIndex 03_llama_index_simple.ipynb

    • Advanced Document Question-Answering with LlamaIndex 04_llamaindex_hier_node_parser.ipynb

    • Interact with REST FastAPI Server 05_dataloader.ipynb

Step 5: Run the Sample Web Application

A sample chatbot web application is provided in the workflow. Requests to the chat system are wrapped in FastAPI calls.

  1. Open the web application at http://host-ip:8090.

  2. Type in the following question without using a knowledge base: “How many cores are on the Nvidia Grace superchip?”

    Note that the chatbot mentions the chip doesn’t exist.

  3. To use a knowledge base:

    • Click the Knowledge Base tab and upload the file nvlink.pdf.

  4. Return to Converse tab and check [X] Use knowledge base.

  5. Retype the question: “How many cores are on the Nvidia Grace superchip?”

    Uploading a file

    upload_file.gif

    Selecting to use the knowledge base

    use_kb.gif


© Copyright 2022-2023, NVIDIA. Last updated on Nov 20, 2023.