LangChain Playbook for NeMo Retriever Text Reranking NIM#

This playbook demonstrates how to use NeMo Retriever Text Reranking NIM with LangChain for document compression and retrieval, and by using the NVIDIARerank class.

Reranking is crucial for achieving high accuracy and efficiency in retrieval pipelines. It plays a vital role, particularly when the pipeline incorporates citations from diverse datastores, where each datastore might employ its own unique similarity scoring algorithm. Reranking serves the following primary purposes:

Improve accuracy for individual citations within each datastore.
Integrate results from multiple datastores to provide a cohesive and relevant set of citations.

Notebook Requirements#

Access to a supported GPU
Docker version 26.1.4 or later
NVIDIA NIM for LLMs or NVIDIA API Catalog
Text Reranking NIM deployed on your infrastructure
Python version 3.10.12 or later
Jupyter Notebook (optional)

Setup#

Use the following bash script to install the required packages and dependencies necessary for this playbook.

cat > requirements.txt << "EOF"
faiss_cpu==1.8.0
fastapi==0.115.6
langchain==0.3.13
langchain-community==0.3.12
langchain-core==0.3.27
langchain-nvidia-ai-endpoints==0.3.7
numpy==1.26.4
sentence-transformers==3.3.1
unstructured==0.16.11
EOF

pip install -r requirements.txt

Use NVIDIA API Catalog or NVIDIA NIM for LLMs#

Use one of the following examples to initialize the LLM for this playbook. The first example uses the NVIDIA API Catalog; the second uses NVIDIA NIM for LLMs. You can access the chat model for either example using the ChatNVIDIA class from the langchain-nvidia-ai-endpoints package, which contains LangChain integrations for building applications with models on NVIDIA NIM for LLMs. For more information, see the ChatNVIDIA documentation.

Option 1: NVIDIA API Catalog#

To use the NVIDIA API Catalog, you’ll need to set the NVIDIA_API_KEY as an environmental variable. See NGC Authentication for information about generating and using an API Key.

import os
from langchain_nvidia_ai_endpoints import ChatNVIDIA

os.environ["NVIDIA_API_KEY"] = "nvapi-***"
llm = ChatNVIDIA(model="meta/llama-3.1-8b-instruct")

Option 2: NVIDIA NIM for LLMs#

To use NVIDIA NIM for LLMs, follow the instructions in Getting Started. After you have deployed the NIM on your infrastructure, use the Python ChatNVIDIA class to access the NIM, as shown in the following example.

from langchain_nvidia_ai_endpoints import ChatNVIDIA

# connect to a LLM NIM running at localhost:8000, specifying a specific model
llm = ChatNVIDIA(base_url="http://localhost:8000/v1", model="meta/llama-3.1-8b-instruct")

After the LLM is ready, use LangChain’s ChatPromptTemplate class to structure multi-turn conversations and format inputs for the language model, as shown in the following example.

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
    ("system", (
        "You are a helpful and friendly AI!"
        "Your responses should be concise and no longer than two sentences."
        "Say you don't know if you don't have this information."
    )),
    ("user", "{question}")
])

chain = prompt | llm | StrOutputParser()

To interact with the LLM in the LangChain Expression Language (LCEL) chain, use the invoke method, as shown in the following example.

print(chain.invoke({"question": "What's the difference between a GPU and a CPU?"}))

This query should produce output similar to the following:

A GPU, or Graphics Processing Unit, is a specialized type of processor designed to quickly render and manipulate graphics and images. A CPU, or Central Processing Unit, is the primary processing component of a computer that performs most of the processing inside the computer. While CPUs are still important for general processing tasks, GPUs are better suited for parallel processing and are often used for tasks involving graphics, gaming, and machine learning.

Next ask the following question about the NVIDIA H200 GPU. Since the knowledge cutoff for many LLMs is late 2022 or early 2023, the model might not have access to information after that timeframe.

print(chain.invoke({"question": "What is the H in the NVIDIA H200?"}))

I’m sorry, at the moment I don’t have information on what the ‘H’ in the NVIDIA H200 stands for. It could possibly be a model-specific identifier or code. You might want to check NVIDIA’s official documentation or contact them directly for clarification.

Reranking with Text Reranking NIM#

To answer the previous question, build a simple retrieval and reranking pipeline to find the most relevant piece of information to the query.

Load the NVIDIA H200 Datasheet to use in the retrieval pipeline. LangChain provides a variety of document loaders for various types of documents, such as HTML, PDF, and code, from sources and locations such as private S3 buckets and public websites. The following example uses a LangChain PyPDFLoader to load a datasheet about the NVIDIA H200 Tensor Core GPU.

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("https://nvdam.widen.net/content/udc6mzrk7a/original/hpc-datasheet-sc23-h200-datasheet-3002446.pdf")

document = loader.load()
document[0]

Once documents have been loaded, they are often transformed. One method of transformation is known as chunking, which breaks down large pieces of text, such as a long document, into smaller segments. This technique is valuable because it helps optimize the relevance of the content returned from the vector database.

LangChain provides a variety of document transformers, such as text splitters. The following example uses a RecursiveCharacterTextSplitter. The RecursiveCharacterTextSplitter is divides a large body of text into smaller chunks based on a specified chunk size. It employs recursion as its core mechanism for splitting text, utilizing a predefined set of characters, such as “\n\n”, “\n”, “ “, and “”, to determine where splits should occur. The process begins by attempting to split the text using the first character in the set. If the resulting chunks are still larger than the desired chunk size, it proceeds to the next character in the set and attempts to split again. This process continues until all chunks adhere to the specified maximum chunk size.

There are some nuanced complexities to text splitting since, in theory, semantically related text should be kept together.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", ";", ",", " ", ""],
)

document_chunks = text_splitter.split_documents(document)
print("Number of chunks from the document:", len(document_chunks))

The following example uses LangChain to interact with Text Reranking NIM by using the NVIDIAReranking class from the same langchain-nvidia-ai-endpoints package as the previous example. Before you use this example, verify that NeMo Retriever Text Reranking NIM is running. This example uses nvidia/llama-3.2-nv-embedqa-1b-v2. Update model if you use a different Text Reranking NIM.

from langchain_nvidia_ai_endpoints import NVIDIARerank

query = "What is the H in the NVIDIA H200?"

# Initialize and connect to a NeMo Retriever Text Reranking NIM (nvidia/llama-3.2-nv-rerankqa-1b-v2) running at localhost:8002
reranker = NVIDIARerank(model="nvidia/llama-3.2-nv-rerankqa-1b-v2",
                        base_url="http://localhost:8002/v1")

reranked_chunks = reranker.compress_documents(query=query,
                                              documents=document_chunks)

The next section shows the results of using Text Reranking NIM to rerank the document chunks based on a relevance score from the query to the document.

for chunks in reranked_chunks:

    # Access the metadata of the document
    metadata = chunks.metadata

    # Get the page content
    page_content = chunks.page_content
    
    # Print the relevance score if it exists in the metadata, followed by page content
    if 'relevance_score' in metadata:
        print(f"Relevance Score: {metadata['relevance_score']}, Page Content: {page_content}...")
    print(f"{'-' * 100}")

This command should produce output similar to the following:

Relevance Score: 11.390625, Page Content: NVIDIA H200 Tensor Core GPU | Datasheet | 1

NVIDIA H200 Tensor Core GPU Supercharging AI and HPC workloads. Higher Performance With Larger, Faster Memory The NVIDIA H200 Tensor Core GPU supercharges generative AI and high- performance computing (HPC) workloads with game-changing performance and memory capabilities. Based on the NVIDIA Hopper™ architecture, the NVIDIA H200 is the first GPU to offer 141 gigabytes (GB) of HBM3e memory at 4.8 terabytes per second (TB/s)— that’s nearly double the capacity of the NVIDIA H100 Tensor Core GPU with 1.4X more memory bandwidth. The H200’s larger and faster memory accelerates generative AI and large language models, while advancing scientific computing for HPC workloads with better energy efficiency and lower total cost of ownership. Unlock Insights With High-Performance LLM Inference In the ever-evolving landscape of AI, businesses rely on large language models to…

Relevance Score: 10.25, Page Content: NVIDIA H200 Tensor Core GPU | Datasheet | 3

AI Acceleration for Mainstream Enterprise Servers With
H200 NVL NVIDIA H200 NVL is ideal for lower-power, air-cooled enterprise rack designs that require flexible configurations, delivering acceleration for every AI and HPC workload regardless of size. With up to four GPUs connected by NVIDIA NVLink™ and a 1.5X memory increase, large language model (LLM) inference can be accelerated up to 1.7X and HPC applications achieve up to 1.3X more performance over the H100 NVL. Enterprise-Ready: AI Software Streamlines Development and Deployment NVIDIA H200 NVL comes with a five-year NVIDIA AI Enterprise subscription and simplifies the way you build an enterprise AI-ready platform. H200 accelerates AI development and deployment for production-ready generative AI solutions, including computer vision, speech AI, retrieval augmented generation (RAG), and more. NVIDIA AI Enterprise includes NVIDIA NIM™, a set of easy-to-use…

Relevance Score: 9.109375, Page Content: NVIDIA H200 Tensor Core GPU | Datasheet | 2

Supercharge High-Performance Computing Memory bandwidth is crucial for HPC applications, as it enables faster data transfer and reduces complex processing bottlenecks. For memory-intensive HPC applications like simulations, scientific research, and artificial intelligence, the H200’s higher memory bandwidth ensures that data can be accessed and manipulated efficiently, leading to 110X faster time to results. Preliminary specifications. May be subject to change. HPC MILC- dataset NERSC Apex Medium | HGX H200 4-GPU | dual Sapphire Rapids 8480 HPC Apps- CP2K: dataset H2O-32-RI-dRPA-96points | GROMACS: dataset STMV | ICON: dataset r2b5 | MILC: dataset NERSC Apex Medium | Chroma: dataset HMC Medium | Quantum Espresso: dataset AUSURF112 | 1x H100 SXM | 1x H200 SXM. Reduce Energy and TCO With the introduction of H200, energy efficiency and TCO reach new levels. This…

Relevance Score: 9.109375, Page Content: NVIDIA MGX™ H200 NVL partner and NVIDIA-Certified Systems with up to 8 GPUs

NVIDIA AI Enterprise Add-on Included 1. Preliminary specifications. May be subject to change. 2. With sparsity. Ready to Get Started? To learn more about the NVIDIA H200 Tensor Core GPU, visit nvidia.com/h200 © 2024 NVIDIA Corporation and affiliates. All rights reserved. NVIDIA, the NVIDIA logo, HGX, Hopper, MGX, NIM, NVIDIA-Certified Systems, and NVLink are trademarks and/or registered trademarks of NVIDIA Corporation and affiliates in the U.S. and other countries. Other company and product names may be trademarks of the respective owners with which they are associated. 3512650. NOV24…

Relevance Score: 5.125, Page Content: With the introduction of H200, energy efficiency and TCO reach new levels.

This cutting-edge technology offers unparalleled performance, all within the same power profile as the H100 Tensor Core GPU. AI factories and supercomputing systems that are not only faster but also more eco-friendly deliver an economic edge that propels the AI and scientific communities forward. Preliminary specifications. May be subject to change. Llama2 70B: ISL 2K, OSL 128 | Throughput | H100 SXM 1x GPU BS 8 | H200 SXM 1x GPU BS 32…

Use Text Reranking NIM with LCEL#

One challenge with retrieval is that usually you don’t know the specific queries your document storage system will face when you ingest data into the system. This means that the information most relevant to a query may be buried in a document with a lot of irrelevant text. Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is a technique to improve retrieval systems by:

Addressing the challenge of handling unknown future queries when ingesting data.
Reducing irrelevant text in retrieved documents to improve LLM response quality and efficiency.
Compressing individual documents and filtering out irrelevant ones based on the query context.

The Contextual Compression Retriever requires:

A base retriever
A document Compressor

It works by:

Passing queries to the base retriever
Sending retrieved documents through the Document Compressor
Shortening the list of documents by reducing content or removing irrelevant documents entirely

The following example demonstrates how to use the Text Reranking NIM as a document compressor with LangChain.

First initialize an embedding model to embed the query and document chunks. There are two examples for this. One using the NVIDIA API Catalog and the other using the Text Embedding NIM.

Option 1: NVIDIA API Catalog#

To use the NVIDIA API Catalog, set the NVIDIA_API_KEY as an environmental variable if you didn’t already set it when initializing the LLM in the first part of this playbook. See NGC Authentication for information about on how to generate an API Key.

import os
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

# Set the NVIDIA_API_KEY environment variable if it hasn't been set already
# os.environ["NVIDIA_API_KEY"] = "nvapi-***"

embedding_model = NVIDIAEmbeddings(model="nvidia/llama-3.2-nv-embedqa-1b-v2")

Option 2: Text Embedding NIM#

To use Text Embedding NIM, follow the instructions in Get Started With NeMo Retriever Text Embedding NIM. After you have deployed the NIM on your infrastructure, you can access it using the NVIDIAEmbeddings class, as shown in the following example.

from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

# Initialize and connect to a NeMo Retriever Text Embedding NIM running at localhost:8001
embedding_model = NVIDIAEmbeddings(model="nvidia/llama-3.2-nv-embedqa-1b-v2",
                                   base_url="http://localhost:8001/v1")

Next, we’ll initialize a simple vector store retriever and store the document chunks of the NVIDIA H200 datasheet. LangChain provides support for a great selection of vector stores, we’ll be using FAISS for this example.

from langchain_community.vectorstores import FAISS

retriever = FAISS.from_documents(document_chunks, embedding=embedding_model).as_retriever(search_kwargs={"k": 10})

Wrap the base retriever with a ContextualCompressionRetriever class, using NVRerank as a document compressor, by running the following code. This example uses llama-3.2-nv-embedqa-1b-v2. Update model if you use a different Text Reranking NIM.

from langchain.retrievers import ContextualCompressionRetriever
from langchain_nvidia_ai_endpoints import NVIDIARerank

# Re-initialize and connect to a NeMo Retriever Text Reranking NIM running at localhost:8002
compressor = NVIDIARerank(model="nvidia/llama-3.2-nv-rerankqa-1b-v2",
                          base_url="http://localhost:8002/v1")

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
)

Next, ask the LLM the same question about the “H” in NVIDIA H200 again but with the retrieval and reranking pipeline.

from langchain.chains import RetrievalQA

query = "What is the H in the NVIDIA H200?"

chain = RetrievalQA.from_chain_type(llm=llm, retriever=compression_retriever)
chain.invoke(query)

This command should produce output similar to the following:

{‘query’: ‘What is the H in the NVIDIA H200?’,

‘result’: ‘Based on the provided context, the “H” in NVIDIA H200 refers to the NVIDIA Hopper architecture. The text states: “Based on the NVIDIA Hopper architecture, the NVIDIA H200 is the first GPU to offer 141 gigabytes (GB) of HBM3e memory”.’}