LangChain Playbook
In LLM and retrieval-augmented generation (RAG) workflows, embeddings transform text into vectors that capture semantic meaning. This enables efficient search for contextually relevant documents based on a user’s query. These documents are then provided as additional context to the LLM, enhancing its ability to generate accurate responses.
This playbook goes over how to use the NeMo Retriever Text Embedding NIM (Text Embedding NIM) with LangChain for a RAG workflow using the NVIDIAEmbeddings
class. First, it shows how to generate embeddings from a user query. Then, it uses this approach to embed a document, store the embeddings in a vector store, and finally uses the embeddings in a LangChain Expression Language (LCEL) chain to help the LLM answer a question about the NVIDIA H200.
Access to a supported GPU
Docker version 26.1.4 or later
NVIDIA NIM for LLMs or NVIDIA API Catalog
Python version 3.11.9 or later
Jupyter Notebook (optional)
Use the following bash
script to install the required packages and dependencies necessary for this playbook.
cat > requirements.txt << "EOF"
faiss_cpu==1.8.0
fastapi==0.104.1
langchain==0.1.11
langchain-community==0.0.25
langchain-core==0.1.29
langchain-nvidia-ai-endpoints==0.2.0
numpy==1.26.4
sentence-transformers==2.2.2
unstructured==0.11.8
EOF
pip install -r requirements.txt
Next, initialize the LLM for this playbook. This playbook provides two examples for this. One using the NVIDIA API Catalog and one using NVIDIA NIM for LLMs. You can access the chat models for both methods using the ChatNVIDIA
class from the langchain-nvidia-ai-endpoints
package, which contains LangChain integrations for building applications with models on NVIDIA NIM for large language models (LLMs). For more information, see the ChatNVIDIA documentation.
Option 1: NVIDIA API Catalog
To use the NVIDIA API Catalog, set the NVIDIA_API_KEY
as an environmental variable. See NGC Authentication for information about generating and using an API Key.
import os
from langchain_nvidia_ai_endpoints import ChatNVIDIA
os.environ["NVIDIA_API_KEY"] = "nvapi-***"
llm = ChatNVIDIA(model="meta/llama3-8b-instruct")
Option 2: NVIDIA NIM for LLMs
To use NVIDIA NIM for LLMs, follow the instructions in Getting Started. After you have deployed the NIM on your infrastructure, you can access it using the ChatNVIDIA
class, as shown in the following example.
from langchain_nvidia_ai_endpoints import ChatNVIDIA
# connect to a LLM NIM running at localhost:8000, specifying a specific model
llm = ChatNVIDIA(base_url="http://localhost:8000/v1", model="meta/llama3-8b-instruct")
After the LLM is ready, you can use it with LangChain’s ChatPromptTemplate
, which is a class for structuring multi-turn conversations and formatting inputs for the language model.
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
prompt = ChatPromptTemplate.from_messages([
("system", (
"You are a helpful and friendly AI!"
"Your responses should be concise and no longer than two sentences."
"Say you don't know if you don't have this information."
)),
("user", "{question}")
])
chain = prompt | llm | StrOutputParser()
To interact with the LLM in the LangChain Expression Language (LCEL) chain, use the invoke
method, as shown in the following example.
print(chain.invoke({"question": "What's the difference between a GPU and a CPU?"}))
This query should produce output similar to the following:
A GPU, or Graphics Processing Unit, is a specialized type of processor designed to quickly render and manipulate graphics and images. A CPU, or Central Processing Unit, is the primary processing component of a computer that performs most of the processing inside the computer. While CPUs are still important for general processing tasks, GPUs are better suited for parallel processing and are often used for tasks involving graphics, gaming, and machine learning.
Use the following example to try another question about NVIDIA’s A100 GPU:
print(chain.invoke({"question": "What does the A in the NVIDIA A100 stand for?"}))
I’m glad to assist you! The “A” in NVIDIA A100 stands for “Ampere,” which is the architecture generation of this GPU model.
Next, ask a question about the NVIDIA H200 GPU. Since the knowledge cutoff for many LLMs is late 2022 or early 2023, the model might not have access to any information after that timeframe.
print(chain.invoke({"question": "How much memory does the NVIDIA H200 have?"}))
I’m sorry, I don’t have that specific information. The NVIDIA H200 doesn’t appear to be a recognized product in NVIDIA’s GPU lineup.
To answer the previous question, build a simple retrieval-augmented generation (RAG) pipeline.
The following example demonstrates how to use LangChain to interact with Text Embedding NIM using the NVIDIAEmbeddings
Python class from the same langchain-nvidia-ai-endpoints
package as the first example. First be sure that Text Embedding NIM is running. Since this example uses the nvidia/nv-embedqa-e5-v5
Text Embedding NIM, update model
accordingly if you are using a different Text Embedding NIM.
Generate embeddings from a user query with the following command:
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings
# Initialize and connect to a NeMo Retriever Text Embedding NIM (nvidia/nv-embedqa-e5-v5) running at localhost:8000
embedding_model = NVIDIAEmbeddings(model="nvidia/nv-embedqa-e5-v5",
base_url="http://localhost:8000/v1")
# Create vector embeddings of the query
embedding_model.embed_query("How much memory does the NVIDIA H200 have?")[:10]
Next, load a PDF of the NVIDIA H200 Datasheet. This document becomes the knowledge base that the LLM uses to retrieve relevant information to answer questions.
LangChain provides a variety of document loaders that load various types of documents (HTML, PDF, code) from many different sources and locations (private s3 buckets, public websites). This example uses the LangChain <a href="https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.PyPDFLoader.html">PyPDFLoader</a>
to load the datasheet about the NVIDIA H200 Tensor Core GPU.
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("https://nvdam.widen.net/content/udc6mzrk7a/original/hpc-datasheet-sc23-h200-datasheet-3002446.pdf")
document = loader.load()
document[0]
Once documents have been loaded, they are often transformed. One method of transformation is known as chunking, which breaks down large pieces of text, such as the text from a long document, into smaller segments. This technique is valuable because it helps optimize the relevance of the content returned from the vector database.
LangChain provides a variety of document transformers, such as text splitters. In this example, we use a <a href="https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html">RecursiveCharacterTextSplitter</a>
. The RecursiveCharacterTextSplitter
is divides a large body of text into smaller chunks based on a specified chunk size. It employs recursion as its core mechanism for splitting text, utilizing a predefined set of characters, such as “\n\n”, “\n”, “ “, and “”, to determine where splits should occur. The process begins by attempting to split the text using the first character in the set. If the resulting chunks are still larger than the desired chunk size, it proceeds to the next character in the set and attempts to split again. This process continues until all chunks adhere to the specified maximum chunk size.
There are some nuanced complexities to text splitting since semantically related text, in theory, should be kept together.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=100,
separators=["\n\n", "\n", ".", ";", ",", " ", ""],
)
document_chunks = text_splitter.split_documents(document)
print("Number of chunks from the document:", len(document_chunks))
The following code snippet demonstrates how to create vector embeddings for a single document. This step is not necessary for the RAG pipeline, but included here for demonstration purposes. The example uses the embedding model to convert the text chunks into a vectors. It displays only the first 10 elements of this vector from the first document chunk to get a glimpse of what these embeddings look like.
# Extract text (page content) from the document chunks
page_contents = [doc.page_content for doc in document_chunks]
# Create vector embeddings from the document
embedding_model.embed_documents(page_contents)[0][:10]
Once the document embeddings are generated, they are stored in a vector store. When a user query is received, you can:
- Embed the query
- Perform a similarity search in the vector store to retrieve the most relevant document embeddings
- Use the retrieved documents to generate a response to the user's query
A vector store takes care of storing the embedded data and performing a vector search. LangChain provides support for a variety of vector stores, we’ll be using FAISS for this example.
from langchain_community.vectorstores import FAISS
vector_store = FAISS.from_documents(document_chunks, embedding=embedding_model)
The next example integrates the vector database with the LLM. A LangChain Expression Language (LCEL) combines these components together. It then formulates the prompt placeholders (context and question) and pipes them to our LLM connector to answer the original question from the first example (How much memory does the NVIDIA H200 have?
) with embeddings from the NVIDIA H200 datasheet
document.
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
prompt = ChatPromptTemplate.from_messages([
("system",
"You are a helpful and friendly AI!"
"Your responses should be concise and no longer than two sentences."
"Do not hallucinate. Say you don't know if you don't have this information."
# "Answer the question using only the context"
"\n\nQuestion:{question}\n\nContext:{context}"
),
("user", "{question}")
])
chain = (
{
"context": vector_store.as_retriever(),
"question": RunnablePassthrough()
}
| prompt
| llm
| StrOutputParser()
)
print(chain.invoke("How much memory does the NVIDIA H200 have?"))
This query should produce output similar to the following:
The NVIDIA H200 Tensor Core GPU offers 141 gigabytes (GB) of HBM3e memory.
You can also try another question that requires retrieval of information from the NVIDIA H200 datasheet, as shown in the following example.
print(chain.invoke("Is the NVIDIA H200 PCIe or SXM based?"))
Based on the provided document, the NVIDIA H200 can be both PCIe and SXM based.