LangChain Playbook for NeMo Retriever Text Embedding NIM#

This playbook demonstrates how to use NeMo Retriever Text Embedding NIM with LangChain for a RAG workflow and by using the NVIDIAEmbeddings class. First it generates embeddings from a user query. Then it embeds a document and stores the embeddings in a vector store. Finally it uses the embeddings in a LangChain Expression Language chain to help the LLM answer a question.

In LLM and retrieval-augmented generation (RAG) workflows, embeddings transform text into vectors that capture semantic meaning. This enables efficient search for contextually relevant documents based on a user’s query. These documents are then provided as additional context to the LLM, enhancing its ability to generate accurate responses.

Notebook Requirements#

Access to a supported GPU.
Docker version 26.1.4 or later.
NVIDIA NIM for LLMs or NVIDIA API Catalog.
Text Embedding NIM deployed on your infrastructure.
Python version 3.10.12 or later.
Jupyter Notebook (optional).

Setup#

Use the following bash script to install the required packages and dependencies necessary for this playbook.

cat > requirements.txt << "EOF" 
faiss_cpu==1.8.0
fastapi==0.115.6
langchain==0.3.13
langchain-community==0.3.12
langchain-core==0.3.27
langchain-nvidia-ai-endpoints==0.3.7
numpy==1.26.4
sentence-transformers==3.3.1
unstructured==0.16.11
EOF

pip install -r requirements.txt

Use NVIDIA API Catalog or NVIDIA NIM for LLMs#

Next, initialize the LLM for this playbook. This playbook provides two examples for this. One using the NVIDIA API Catalog and one using NVIDIA NIM for LLMs. You can access the chat models for both methods using the ChatNVIDIA class from the langchain-nvidia-ai-endpoints package, which contains LangChain integrations for building applications with models on NVIDIA NIM for large language models (LLMs). For more information, see the ChatNVIDIA documentation.

Option 1: NVIDIA API Catalog#

To use the NVIDIA API Catalog, set the NVIDIA_API_KEY as an environmental variable. See NGC Authentication for information about generating and using an API Key.

import os
from langchain_nvidia_ai_endpoints import ChatNVIDIA

os.environ["NVIDIA_API_KEY"] = "nvapi-***" 
llm = ChatNVIDIA(model="meta/llama-3.1-8b-instruct")

Option 2: NVIDIA NIM for LLMs#

To use NVIDIA NIM for LLMs, follow the instructions in Getting Started. After you have deployed the NIM on your infrastructure, you can access it using the ChatNVIDIA class, as shown in the following example.

from langchain_nvidia_ai_endpoints import ChatNVIDIA

# connect to a LLM NIM running at localhost:8000, specifying a specific model
llm = ChatNVIDIA(base_url="http://localhost:8000/v1", model="meta/llama-3.1-8b-instruct")

After the LLM is ready, you can use it with LangChain’s ChatPromptTemplate, which is a class for structuring multi-turn conversations and formatting inputs for the language model.

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
    ("system", (
        "You are a helpful and friendly AI!"
        "Your responses should be concise and no longer than two sentences."
        "Say you don't know if you don't have this information."
    )),
    ("user", "{question}")
])

chain = prompt | llm | StrOutputParser()

To interact with the LLM in the LangChain Expression Language (LCEL) chain, use the invoke method, as shown in the following example.

print(chain.invoke({"question": "What's the difference between a GPU and a CPU?"}))

This query should produce output similar to the following:

A GPU, or Graphics Processing Unit, is a specialized type of processor designed to quickly render and manipulate graphics and images. A CPU, or Central Processing Unit, is the primary processing component of a computer that performs most of the processing inside the computer. While CPUs are still important for general processing tasks, GPUs are better suited for parallel processing and are often used for tasks involving graphics, gaming, and machine learning.

Use the following example to try another question about NVIDIA’s A100 GPU:

print(chain.invoke({"question": "What does the A in the NVIDIA A100 stand for?"}))

I’m glad to assist you! The “A” in NVIDIA A100 stands for “Ampere,” which is the architecture generation of this GPU model.

Next, ask a question about the NVIDIA H200 GPU. Since the knowledge cutoff for many LLMs is late 2022 or early 2023, the model might not have access to any information after that timeframe.

print(chain.invoke({"question": "How much memory does the NVIDIA H200 have?"}))

I’m sorry, I don’t have that specific information. The NVIDIA H200 doesn’t appear to be a recognized product in NVIDIA’s GPU lineup.

Generate Embeddings with Text Embedding NIM#

To answer the previous question, build a simple retrieval-augmented generation (RAG) pipeline.

The following example uses LangChain to interact with Text Embedding NIM by using the NVIDIAEmbeddings Python class from the same langchain-nvidia-ai-endpoints package as the previous example. Before you use this example, verify that Text Embedding NIM is running. This example uses nvidia/llama-3.2-nv-embedqa-1b-v2. Update model if you use a different Text Embedding NIM.

Generate embeddings from a user query with the following command:

from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

# Initialize and connect to a NeMo Retriever Text Embedding NIM (nvidia/llama-3.2-nv-embedqa-1b-v2) running at localhost:8001
embedding_model = NVIDIAEmbeddings(model="nvidia/llama-3.2-nv-embedqa-1b-v2",
                                   base_url="http://localhost:8001/v1")

# Create vector embeddings of the query
embedding_model.embed_query("How much memory does the NVIDIA H200 have?")[:10]

Next, load a PDF of the NVIDIA H200 Datasheet. This document becomes the knowledge base that the LLM uses to retrieve relevant information to answer questions.

LangChain provides a variety of document loaders that load various types of documents (HTML, PDF, code) from many different sources and locations (private s3 buckets, public websites). This example uses the LangChain PyPDFLoader to load the datasheet about the NVIDIA H200 Tensor Core GPU.

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("https://nvdam.widen.net/content/udc6mzrk7a/original/hpc-datasheet-sc23-h200-datasheet-3002446.pdf")

document = loader.load()
document[0]

Once documents have been loaded, they are often transformed. One method of transformation is known as chunking, which breaks down large pieces of text, such as the text from a long document, into smaller segments. This technique is valuable because it helps optimize the relevance of the content returned from the vector database.

LangChain provides a variety of document transformers, such as text splitters. In this example, we use a RecursiveCharacterTextSplitter. The RecursiveCharacterTextSplitter is divides a large body of text into smaller chunks based on a specified chunk size. It employs recursion as its core mechanism for splitting text, utilizing a predefined set of characters, such as “\n\n”, “\n”, “ “, and “”, to determine where splits should occur. The process begins by attempting to split the text using the first character in the set. If the resulting chunks are still larger than the desired chunk size, it proceeds to the next character in the set and attempts to split again. This process continues until all chunks adhere to the specified maximum chunk size.

There are some nuanced complexities to text splitting since semantically related text, in theory, should be kept together.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", ";", ",", " ", ""],
)

document_chunks = text_splitter.split_documents(document)
print("Number of chunks from the document:", len(document_chunks))

The following code snippet demonstrates how to create vector embeddings for a single document. This step is not necessary for the RAG pipeline, but included here for demonstration purposes. The example uses the embedding model to convert the text chunks into a vectors. It displays only the first 10 elements of this vector from the first document chunk to get a glimpse of what these embeddings look like.

# Extract text (page content) from the document chunks
page_contents = [doc.page_content for doc in document_chunks]

# Create vector embeddings from the document
embedding_model.embed_documents(page_contents)[0][:10]

Store Document Embeddings in the Vector Store#

Once the document embeddings are generated, they are stored in a vector store. When a user query is received, you can:

Embed the query.
Perform a similarity search in the vector store to retrieve the most relevant document embeddings.
Use the retrieved documents to generate a response to the user's query.

A vector store takes care of storing the embedded data and performing a vector search. LangChain provides support for a variety of vector stores, we’ll be using FAISS for this example.

from langchain_community.vectorstores import FAISS

vector_store = FAISS.from_documents(document_chunks, embedding=embedding_model)

Use Text Embedding NIM with LCEL#

The next example integrates the vector database with the LLM. A LangChain Expression Language (LCEL) combines these components together. It then formulates the prompt placeholders (context and question) and pipes them to our LLM connector to answer the original question from the first example (How much memory does the NVIDIA H200 have?) with embeddings from the NVIDIA H200 datasheet document.

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
    ("system", 
        "You are a helpful and friendly AI!"
        "Your responses should be concise and no longer than two sentences."
        "Do not hallucinate. Say you don't know if you don't have this information."
        # "Answer the question using only the context"
        "\n\nQuestion: {question}\n\nContext: {context}"
    ),
    ("user", "{question}")
])

chain = (
    {
        "context": vector_store.as_retriever(),
        "question": RunnablePassthrough()
    }
    | prompt
    | llm
    | StrOutputParser()
)

print(chain.invoke("How much memory does the NVIDIA H200 have?"))

This query should produce output similar to the following:

The NVIDIA H200 has 141 gigabytes (GB) of HBM3e memory.