Technical Brief#

Overview#

Are you curious about how AI chatbots work? First of all, generative AI starts with foundational models trained on vast quantities of unlabeled data. These Large language models (LLMs) are trained on an extensive range of textual data online, and in some cases are trained with a specific industry or domain expertise in mind. LLMs can understand prompts delivered by you, the user, and then generate novel, human-like responses. Businesses can build applications to leverage this capability of LLMs; for example creative writing assistants for marketing, document summarization for legal teams, and code writing for software development.

This reference solution demonstrates how to find business value in generative AI by augmenting an existing foundational LLM to fit your business use case. This is done using retrieval augmented generation (RAG) which retrieves facts from an enterprise knowledge base containing a company’s business data. A reference solution for a powerful RAG-based AI Chatbot is described in this brief including code available in NVIDIA’s Generative AI Examples Github for developers. Pay special attention to ways in which you can augment an LLM with your domain-specific business data to create AI applications that are agile and responsive to new developments.

Some examples include:

Web-based Chatbots: Many enterprises already use AI chatbots to power basic customer interactions on their websites. With RAG, companies can build a chat experience that’s highly specific to their business. For example, questions about product specifications are promptly answered.
Customer Service: Companies can empower live service representatives to quickly answer customer questions with precise, up-to-date information.
Enterprise Documentation Search: Businesses have a wealth of knowledge across the organization, including technical documentation, company policies, IT support articles, and code repositories. Employees can query an internal search engine to retrieve information faster and more efficiently.
Financial Services Tabular Data Search: AI can navigate through vast sets of financial data to find crucial insights instantly, thereby enhancing the decision-making process.

This reference solution is based upon the NVIDIA AI RAG chatbot which we used internally to make employee’s lives easier. The chatbot was designed to assist employees with answering questions related to corporate communications. Since this initial deployment, we have used multiple chatbots in-house across a variety of teams to accelerate our productivity across the company.

An open-source LLM from Meta, Llama2 was used as it provides an advanced starting point as well as a low-cost solution that enterprises can leverage to generate accurate and precise responses tailored to their specific use case.

Generative AI starts with foundational models trained on vast quantities of unlabeled data. Large language models (LLMs) are trained on an extensive range of textual data online. These LLMs can understand prompts and generate novel, human-like responses. Businesses can build applications to leverage this capability of LLMs; for example creative writing assistants for marketing, document summarization for legal teams, and code writing for software development.

To create true business value from LLMs, these foundational models need to be tailored to your enterprise use case. In this workflow, we use RAG with Llama2, an open source model from Meta, to achieve this. Augmenting an existing AI foundational model provides an advanced starting point and a low-cost solution that enterprises can leverage to generate accurate and clear responses to their specific use case.

Note

This reference example uses the Llama2 13B parameters chat model, which requires 50 GB of GPU memory, but another compatible foundational LLM could also be used.

The sample knowledge base dataset included in the reference solution contains two years of NVIDIA press releases and corporate blog posts. It was updated routinely so that, for example, when we released a new GPU offering such as Grace Hopper, the augmented LLM could answer more recent and relevant developments.

As you read through the example workflow, it may be useful to write down ways in which you might adapt this reference solution to your business objective.

Software Components#

This RAG-based AI chatbot provides a reference to build an enterprise AI solution with minimal effort and contains the following software components:

Sample Jupyter notebooks as well as a sample chatbot web application with API calls are provided so that you can interactively test the chat system.

The software components are used to deploy LLMs and the inference pipeline. The following diagram provides an overview of the RAG-based AI chatbot reference solution:

NVIDIA NeMo Framework#

Using a foundation model out of the box can be challenging since models are trained to handle a wide variety of tasks but may not contain domain/enterprise-specific knowledge. The NVIDIA NeMo framework helps solve this. It is an end-to-end, cloud-native framework to build, customize, and deploy generative AI models anywhere. The framework includes training and inferencing software, guardrailing, and data curation tools. It supports state-of-the-art community and NVIDIA pre-trained LLMs, offering enterprises an easy, cost-effective, and fast way to adopt generative AI.

NVIDIA TensorRT-LLM Optimization#

NVIDIA NeMo leverages TensorRT-LLM for model deployment, which optimizes the model to achieve ground-breaking inference acceleration and GPU efficiency for the latest LLMs. The efficiency afforded by TensorRT-LLM allows greater flexibility in model deployment, opening up the potential of running concurrent models using the same infrastructure. As mentioned in this reference solution, we leveraged a Llama 2 13B parameter chat model. We convert the foundational model to TensorRT format using TensorRT-LLM for optimized inference.

TensorRT-LLM also includes an optimized scheduling technique called in-flight batching. This takes advantage of the fact that the overall text generation process for an LLM can be broken down into multiple iterations of execution on the model. With in-flight batching, rather than waiting for the whole batch to finish before moving on to the next set of requests, the TensorRT-LLM runtime immediately evicts finished sequences from the batch. It then begins executing new requests while other requests are still in flight.

Inflight batching and the additional kernel-level optimizations improve GPU usage and doubles throughput which helps reduce energy costs and minimize TCO.

NVIDIA Triton Inference Server#

The optimized LLM is deployed with Triton Inference Server for high-performance, cost-effective, and low-latency inference. Triton Inference Server is an inference-serving software that streamlines AI inferencing.

RAG-based AI chatbot Inference Pipeline#

To get started with the inferencing pipeline, first, we connect the LLM to the sample dataset: NVIDIA press releases and corporate blog posts. This external knowledge can come in many forms, including product specifications, HR documents, or finance spreadsheets. Enhancing the model’s capabilities with this knowledge can be done with RAG.

Note

For this reference solution, we used Python to scrape the last two years of NVIDIA press releases and corporate blog posts. These were saved as a series of PDFs.

RAG consists of two processes: 1. Ingestion of documents from document repositories, databases, or APIs that are all outside of the foundational model’s knowledge. 2. Retrieval of relevant document data and generation of responses during inference.

The following graphic describes these processes:

Since RAG begins with a knowledge base of relevant, up-to-date information, ingesting new documents into a knowledge base should be a recurring process and scheduled as a job.

Note

NOTE: The Document Ingestion Pipeline is wrapped in API calls; therefore, this pipeline can easily be scheduled as a recurring batch process.

Document Ingestion and Retrieval#

Content from the knowledge base is passed to an embedding model (e5-large-v2, in the case of this workflow), which converts the content to vectors (referred to as “embeddings”). Generating embeddings is a critical step in RAG; it allows for dense numerical representations of textual information. These vector embeddings are then stored in a vector database, in this case, Milvus, which is RAFT accelerated on NVIDIA GPUs.

Vectors represent data in higher dimensions, and a similarity search is performed by the inference pipeline to identify other vector data points that most closely resemble the current query. This is visually illustrated in the image below.

User Query and Response Generation#

When a user query is sent to the inference server, it is converted to an embedding using the embedding model. This is the same embedding model used to convert the documents in the knowledge base (e5-large-v2, in the case of this workflow). The vector database performs a similarity/semantic search to find the vectors that most closely resemble the user’s intent and provides them to the LLM as enhanced context.

A visual representation of similarity/semantic vector search is illustrated below.

Since Milvus is RAFT accelerated, the similarity serach is optimized on the GPU. Lastly, the LLM is used to generate a full answer that’s streamed to the user. This is all done with ease via LangChain and LlamaIndex.

The following diagram illustrates the ingestion of documents and the generation of responses further.

Since LangChain allows you to write LLM wrappers for your own custom LLMs, we have provided a sample wrapper for streaming responses from a TensorRT-LLM Llama 2 model running on Triton Inference Server. This wrapper allowed us to leverage LangChain’s standard interface for interacting with LLMs, while still achieving vast performance speedup from TensorRT-LLM and scalable and flexible inference from Triton Inference Server.

The sample chatbot web application provided allows you to interactively test the chat system. Requests to the chat system are wrapped in API calls, so these can be abstracted to other applications.

Additional Components#

Triton Inference Server

NVIDIA Triton Inference Server uses models stored in a model repository, available locally to serve inference requests. Once they are available in Triton, inference requests are sent from a client application. Python and C++ libraries provide APIs to simplify communication. Clients send HTTP/REST requests directly to Triton using HTTP/REST or gRPC protocols.

Within this workflow, the Llama2 LLM was optimized using NVIDIA TensorRT for LLMs which accelerates and maximizes inference performance on the latest LLMs.

Model Storage

NVIDIA NGC is used as model storage in this workflow, but you are free to choose different model storage solutions like Azure AI Studio, MLFlow, or AWS SageMaker.

Vector Database

Milvus is an open-source vector database built to power embedding similarity search and AI applications. It makes unstructured data from API calls, PDFs, and other documents more accessible by storing them as embeddings. When content from the knowledge base is passed to an embedding model (e5-large-v2), it converts the content to vectors (referred to as “embeddings”). These embeddings are stored in a vector database. The vector database used in this workflow is Milvus. Milvus is an open-source vector database capable of NVIDIA GPU accelerated vector searches.

While this reference solution uses Milvus, other databases can be used instead. For example, other vector databases that integrate RAPIDS RAFT for GPU acceleration are Chroma, FAISS, or Lance; or you can use other pay-as-you-go databases such as Redis, Pinecone, and MongoDB vCore.

Note

If needed, see Milvus’s documentation for how a Docker Compose file can be configured for Milvus.

Review#

Let’s review key questions that can help you adapt this reference solution to your specific business use case:

What are the different data sources within your enterprise that can be used to augment an LLM?

Note

This reference solution uses LangChain. Additional data connectors can be leveraged.
Which components within this reference solution can be used within your enterprise for the inference pipeline? Which components need to be swapped out?

Note

An example would be a model store. This reference solution uses NGC, but your enterprise may already be using MLflow for your model storage solution.

Do you intend to run your RAG based AI Chatbot in production? Enterprises that run their businesses on AI rely on the security, support, and stability provided by NVIDIA AI Enterprise to ensure a smooth transition from pilot to production.