Generative AI enables users to quickly generate new content based on a variety of inputs and is a powerful tool for streamlining the workflow of creatives, engineers, researchers, scientists, and more. The use cases and possibilities span all industries and individuals. Generative AI models can produce novel content like stories, emails, music, images, and videos.
Here at NVIDIA, we like to utilize our own products to make our lives easier, so we have used generative AI to create an NVIDIA chatbot enhanced with retrieval augmented generation. This chatbot answers questions for employees based on our bug filing and tracking system. Our development and deployment of that chatbot is the guide to this reference generative AI workflow.
Generative AI starts with foundational models trained on vast quantities of unlabeled data. Large language models (LLMs) are trained on an extensive range of textual data online. These LLMs can understand prompts and generate novel, human-like responses. Businesses can build applications to leverage this capability of LLMs; for example creative writing assistants for marketing, document summarization for legal teams, and code writing for software development.
To create true business value from LLMs, these foundational models need to be tailored to your enterprise use case.
In this workflow, we apply several techniques to Meta’s open-source Llama 2 model to achieve this: prompt-learning, retrieval augmented generation (RAG), and prompt-templating. Adapting an existing AI foundational model provides an advanced starting point and a low-cost solution that enterprises can leverage to generate accurate and clear responses to their domain-specific use case.
This RAG-based reference chatbot workflow contains:
- NVIDIA NeMo framework for prompt-tuning to generate tailored responses
- NVIDIA TensorRT LLM (TRT-LLM) for low latency and high throughput inference for LLMs
- LangChain for combining language model components and easily constructing question-answering from a company’s database
- Cloud-native deployable bundle packaged as helm charts
This RAG chatbot workflow provides a reference for you to build your own enterprise AI solution with minimal effort. It includes enterprise-ready implementation best practices which range from secrets management, monitoring, reporting, and load balancing, helping you achieve the desired AI outcome more quickly while still allowing a path for you to deviate.
The components and instructions in the workflow are intended to be used as examples for integration and may not be sufficiently production-ready on their own as stated. The workflow should be used as a reference, and customized and integrated into one’s own environment
This AI workflow was designed to be deployed on a cloud-native NVIDIA AI Enterprise-supported Kubernetes-based platform, which can be deployed on-prem or using a cloud service provider (CSP).
These components are used to build and deploy training and inference pipelines, integrated together with the additional components as indicated in the diagram below:
This reference workflow uses a variety of NVIDIA AI components to customize and deploy the RAG based chatbot.
NVIDIA NeMo Framework
NVIDIA TensorRT LLM
NVIDIA Triton Inference Server
NVIDIA Cloud Native-Add On Pack
The following sections describe these NVIDIA AI components further.
NVIDIA NeMo Framework
As discussed, using a foundation model out of the box can be challenging since models are trained to handle a wide variety of tasks but may not contain domain/enterprise-specific knowledge. NVIDIA NeMo framework helps solve this; it is an end-to-end, cloud-native framework to build, customize, and deploy generative AI models anywhere. The framework includes training and inferencing software, guardrailing, and data curation tools, and supports state-of-the art community and NVIDIA pretrained LLMs, offering enterprises an easy, cost-effective, and fast way to adopt generative AI.
NVIDIA TensorRT LLM (TRT-LLM)
Once the LLM has been customized, it can be optimized using NVIDIA TRT-LLM. NVIDIA NeMo uses TensorRT for LLMs (TRT-LLM), for deployment which accelerates and maximizes inference performance on the latest LLMs.
NVIDIA Triton Inference Server
With NVIDIA Triton Inference Server, the optimized LLM can be deployed for high-performance, cost-effective, and low-latency inference. NVIDIA Triton Inference Server, supported by NVIDIA AI Enterprise, is an inference-serving software that streamlines AI inferencing.
Cloud Native Add-On Pack
The NVIDIA Cloud Native Service Add-on Pack is also used within this workflow. This Add-on Pack is a set of packaged components, designed for AI Workflows, which provide the basic functionalities required for enterprise deployments of AI applications on Kubernetes-based infrastructure. The AI framework specific for Language Generative AI is also included as an OCI-compliant base container image. The following is a list of additional components that are included within the NVIDIA Knowledge Base Chatbot Generative AI Workflow and are used within this lab:
Prometheus/Grafana for Inference Pipeline Monitoring/Dashboard
Cert-manager for managing TLS certificates
HAProxy Kubernetes Ingress
In this workflow, we demonstrate prompt learning, a customization technique. It is important to note that not all foundation models need to be customized, but it ultimately comes down to both the foundational model and the use case. Therefore, this part of the AI workflow is considered optional.
There are additional methods of customization that the NVIDIA NeMo framework supports that are not shown in this workflow.
In this workflow, we will be leveraging a Llama 2 (13B parameters) base model. We prompt-tune the base model on the Databricks Dolly 15k dataset, an open-source dataset of instruction-following records. This allows some base models to be better equipped for the task of closed-book question-answering, which is a valuable task for question-answering chatbots.
Llama2 models are considered to be very accurate and can provide great results in most situations. However, if you need to tailor it to a specific task, prompt-tuning can be leveraged.
Once the LLM is customized within the workflow, we will convert it to TensorRT format using NVIDIA TensorRT LLM. This conversion accelerates and optimizes the LLM for inference. We will use this prompt-tuned and converted LLM later during the inference portion of the workflow.
To get started with the inferencing pipeline, we will first connect the customized LLM to a sample proprietary data source. This knowledge can come in many forms: product specifications, HR documents, or finance spreadsheets. Enhancing the model’s capabilities with this knowledge can be done with retrieval augmented generation (RAG).
Since foundational LLMs are not trained on your proprietary enterprise data and are only trained up to a fixed point in time, they need to be augmented with additional data. RAG consists of two processes. First, retrieval of data like document repositories, databases, or APIs that are all outside of the foundational model’s knowledge. Second, is the generation of responses. Within this workflow, we will use LangChain, which provides a simple framework for connecting LLMs to data sources. The example used within this workflow is an internal bug reporting/tracking system which is fetched via APIs.
Document Retrieval and Ingestion
RAG begins with a knowledge base of relevant up-to-date information. Since data within an enterprise is frequently updated, the ingestion of documents into a knowledge base should be a recurring process and scheduled as a job. Next, content from the knowledge base is passed to an embedding model (SentenceTransformers, in the case of this workflow), which converts the content to vectors (referred to as “embeddings”). These embeddings are stored in a vector database (Milvus, in the case of this workflow).
User Query and Response Generation
When a user query is sent to the inference server, it is converted to an embedding using the embedding model. This is the same embedding model used to convert the documents in the knowledge base (SentenceTransformers, in the case of this workflow). The database performs a similarity/semantic search to find the vectors that most closely resemble the user’s intent and provides them to the LLM as enhanced context. Lastly, the LLM is used to generate a full answer that’s streamed to the user. This is all done with ease via LangChain.
LangChain allows you to write LLM wrappers for your own custom LLMs, so we have provided a sample wrapper for streaming responses from a TRT-LLM Llama 2 model running on Triton Inference Server (TIS). This wrapper allows us to leverage LangChain’s standard interface for interacting with LLMs while still achieving vast performance speedup from TRT-LLM and scalable and flexible inference from TIS.
A sample chatbot web application is provided in the workflow so that you can test the chat system in an interactive manner. Requests to the chat system are wrapped in API calls, so these can be abstracted to other applications.
An additional method of customization in the AI Workflow inference pipeline is via a prompt template. A prompt template is a pre-defined recipe for generating prompts for language models. They may contain instructions, few-shot examples, and context appropriate for a given task. In our example, we prompt our model to generate safe and polite responses.
The following diagram illustrates the retrieval of documents and generation of responses.
NVIDIA NeMo Guardrails, not shown in this workflow but open-source and available to use, can take prompt-templating to the next level by guiding an entire conversation in more nuanced and complex ways.
NVIDIA NGC is used as model storage in this workflow, but you are free to choose different model storage solutions like MLFlow or AWS SageMaker.
Milvus is an open-source vector database built to power embedding similarity search and AI applications. It makes unstructured data from API calls, PDFs, and other documents more accessible. It’s a cloud-native vector database with storage and computation separated by design.
Prometheus is an open-source monitoring and alerting solution. In this workflow, it stores pipeline performance metrics from Triton, which enables System Administrators to understand the health and throughput of the system. While the metrics are available in plain text, Grafana is also used for visualization of the metrics via a dashboard. Some of the metrics available; are shown below; depending on the usage metrics, the Triton pods scale automatically.