Text Reranking (Latest)
Microservices

LangChain Playbook

Reranking is crucial for achieving high accuracy and efficiency in retrieval pipelines. It plays a vital role, particularly when the pipeline incorporates citations from diverse datastores, where each datastore may employ its own unique similarity scoring algorithm. Reranking serves two primary purposes:

  1. Improving accuracy for individual citations within each datastore.
  2. Integrating results from multiple datastores to provide a cohesive and relevant set of citations.

This playbook goes over how to use the NeMo Retriever Text Reranking NIM (Text Reranking NIM) with LangChain for document compression and retrieval via the NVIDIARerank class.

Use the following bash script to install the required packages and dependencies necessary for this playbook.

Copy
Copied!
            

cat > requirements.txt << "EOF" faiss_cpu==1.8.0 fastapi==0.104.1 langchain==0.1.11 langchain-community==0.0.25 langchain-core==0.1.29 langchain-nvidia-ai-endpoints==0.1.4 numpy==1.26.4 sentence-transformers==2.2.2 unstructured==0.11.8 EOF pip install -r requirements.txt

Use one of the following examples to initialize the LLM for this playbook. The first example uses the NVIDIA API Catalog; the second uses NVIDIA NIM for LLMs. You can access the chat model for either example using the ChatNVIDIA class from the langchain-nvidia-ai-endpoints package, which contains LangChain integrations for building applications with models on NVIDIA NIM for LLMs. For more information, see the ChatNVIDIA documentation.

Option 1: NVIDIA API Catalog

To use the NVIDIA API Catalog, you’ll need to set the NVIDIA_API_KEY as an environmental variable. See NGC Authentication for information about generating and using an API Key.

Copy
Copied!
            

import os from langchain_nvidia_ai_endpoints import ChatNVIDIA os.environ["NVIDIA_API_KEY"] = "nvapi-***" llm = ChatNVIDIA(model="meta/llama3-8b-instruct")

Option 2: NVIDIA NIM for LLMs

To use NVIDIA NIM for LLMs, follow the instructions in Getting Started. After you have deployed the NIM on your infrastructure, use the Python ChatNVIDIA class to access the NIM, as shown in the following example.

Copy
Copied!
            

from langchain_nvidia_ai_endpoints import ChatNVIDIA # connect to a LLM NIM running at localhost:8000, specifying a specific model llm = ChatNVIDIA(base_url="http://localhost:8000/v1", model="meta/llama3-8b-instruct")

After the LLM is ready, use LangChain’s ChatPromptTemplate class to structure multi-turn conversations and format inputs for the language model, as shown in the following example.

Copy
Copied!
            

from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser prompt = ChatPromptTemplate.from_messages([ ("system", ( "You are a helpful and friendly AI!" "Your responses should be concise and no longer than two sentences." "Say you don't know if you don't have this information." )), ("user", "{question}") ]) chain = prompt | llm | StrOutputParser()

To interact with the LLM in the LangChain Expression Language (LCEL) chain, use the invoke method, as shown in the following example.

Copy
Copied!
            

print(chain.invoke({"question": "What's the difference between a GPU and a CPU?"}))

This query should produce output similar to the following:

A GPU, or Graphics Processing Unit, is a specialized type of processor designed to quickly render and manipulate graphics and images. A CPU, or Central Processing Unit, is the primary processing component of a computer that performs most of the processing inside the computer. While CPUs are still important for general processing tasks, GPUs are better suited for parallel processing and are often used for tasks involving graphics, gaming, and machine learning.

Next ask the following question about the NVIDIA H200 GPU. Since the knowledge cutoff for many LLMs is late 2022 or early 2023, the model might not have access to information after that timeframe.

Copy
Copied!
            

print(chain.invoke({"question": "What does the H in the NVIDIA H200 stand for?"}))

I’m sorry, at the moment I don’t have information on what the ‘H’ in the NVIDIA H200 stands for. It could possibly be a model-specific identifier or code. You might want to check NVIDIA’s official documentation or contact them directly for clarification.

To answer the previous question, build a simple retrieval and reranking pipeline to find the most relevant piece of information to the query.

Load the NVIDIA H200 Datasheet to use in the retrieval pipeline. LangChain provides a variety of document loaders for various types of documents, such as HTML, PDF, and code, from sources and locations such as private S3 buckets and public websites. The following example uses a LangChain PyPDFLoader to load a datasheet about the NVIDIA H200 Tensor Core GPU.

Copy
Copied!
            

from langchain_community.document_loaders import PyPDFLoader loader = PyPDFLoader("https://nvdam.widen.net/content/udc6mzrk7a/original/hpc-datasheet-sc23-h200-datasheet-3002446.pdf") document = loader.load() document[0]

Once documents have been loaded, they are often transformed. One method of transformation is known as chunking, which breaks down large pieces of text, such as a long document, into smaller segments. This technique is valuable because it helps optimize the relevance of the content returned from the vector database.

LangChain provides a variety of document transformers, such as text splitters. The following example uses a RecursiveCharacterTextSplitter. The RecursiveCharacterTextSplitter is divides a large body of text into smaller chunks based on a specified chunk size. It employs recursion as its core mechanism for splitting text, utilizing a predefined set of characters, such as “\n\n”, “\n”, ” “, and “”, to determine where splits should occur. The process begins by attempting to split the text using the first character in the set. If the resulting chunks are still larger than the desired chunk size, it proceeds to the next character in the set and attempts to split again. This process continues until all chunks adhere to the specified maximum chunk size.

There are some nuanced complexities to text splitting since, in theory, semantically related text should be kept together.

Copy
Copied!
            

from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=100, separators=["\n\n", "\n", ".", ";", ",", " ", ""], ) document_chunks = text_splitter.split_documents(document) print("Number of chunks from the document:", len(document_chunks))

The following example shows how to use LangChain to interact with Text Reranking NIM using the NVIDIAReranking class from the same langchain-nvidia-ai-endpoints package as the first example. Be sure that you have the NeMo Retriever Text Reranking NIM running before this step. nvidia/nv-rerankqa-mistral-4b-v3 is used in the following example, update model accordingly if you use a different Text Reranking NIM.

Copy
Copied!
            

from langchain_nvidia_ai_endpoints import NVIDIARerank query = "What does the H in the NVIDIA H200 stand for?" # Initialize and connect to a NeMo Retriever Text Reranking NIM running at localhost:8000 reranker = NVIDIARerank(model="nvidia/nv-rerankqa-mistral-4b-v3", base_url="http://localhost:8000/v1") reranked_chunks = reranker.compress_documents(query=query, documents=document_chunks)

The next section shows the results of using Text Reranking NIM to rerank the document chunks based on a relevance score from the query to the document.

Copy
Copied!
            

for chunks in reranked_chunks: # Access the metadata of the document metadata = chunks.metadata # Get the page content page_content = chunks.page_content # Print the relevance score if it exists in the metadata, followed by page content if 'relevance_score' in metadata: print(f"Relevance Score:{metadata['relevance_score']}, Page Content:{page_content}...") print(f"{'-' * 100}")

This command should produce output similar to the following:

Relevance Score: 2.97265625, Page Content: NVIDIA H200 Tensor Core GPU | Datasheet | 1NVIDIA H200 Tensor Core GPU Supercharging AI and HPC workloads. Higher Performance With Larger, Faster Memory The NVIDIA H200 Tensor Core GPU supercharges generative AI and high- performance computing (HPC) workloads with game-changing performance and memory capabilities. Based on the NVIDIA Hopper™ architecture, the NVIDIA H200 is the first GPU to offer 141 gigabytes (GB) of HBM3e memory at 4.8 terabytes per second (TB/s)—…


Relevance Score: 0.77294921875, Page Content: Ready to get started? To learn more about the NVIDIA H200 Tensor Core GPU, visit nvidia.com/h200 © 2024 NVIDIA Corporation and affiliates. All rights reserved. NVIDIA, the NVIDIA logo, HGX, Hopper, MGX, NVIDIA- Certified Systems, and NVLink are trademarks and/or registered trademarks of NVIDIA Corporation and affiliates in the U.S. and other countries. Other company and product names may be trademarks of the respective owners with…


Relevance Score: 0.473388671875, Page Content: NVIDIA H200 Tensor Core GPU | Datasheet | 2Supercharge High-Performance Computing Memory bandwidth is crucial for HPC applications, as it enables faster data transfer and reduces complex processing bottlenecks. For memory-intensive HPC applications like simulations, scientific research, and artificial intelligence, the H200’s higher memory bandwidth ensures that data can be accessed and manipulated efficiently, leading to 110X faster time to results….


Relevance Score: -1.4150390625, Page Content: Unleashing AI Acceleration for Mainstream Enterprise Servers With H200 NVL The NVIDIA H200 NVL is the ideal choice for customers with space constraints within the data center, delivering acceleration for every AI and HPC workload regardless of size. With a 1.5X memory increase and a 1.2X bandwidth increase over the previous generation, customers can fine-tune LLMs within a few hours and experience LLM inference 1.8X faster….


Relevance Score: -1.4423828125, Page Content: offer 141 gigabytes (GB) of HBM3e memory at 4.8 terabytes per second (TB/s)— that’s nearly double the capacity of the NVIDIA H100 Tensor Core GPU with 1.4X more memory bandwidth. The H200’s larger and faster memory accelerates generative AI and large language models, while advancing scientific computing for HPC workloads with better energy efficiency and lower total cost of ownership. Unlock Insights With High-Performance LLM Inference…

One challenge with retrieval is that usually you don’t know the specific queries your document storage system will face when you ingest data into the system. This means that the information most relevant to a query may be buried in a document with a lot of irrelevant text. Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is a technique to improve retrieval systems by:

  1. Addressing the challenge of handling unknown future queries when ingesting data.
  2. Reducing irrelevant text in retrieved documents to improve LLM response quality and efficiency.
  3. Compressing individual documents and filtering out irrelevant ones based on the query context.

The Contextual Compression Retriever requires:

  • A base retriever

  • A document Compressor

It works by:

  1. Passing queries to the base retriever
  2. Sending retrieved documents through the Document Compressor
  3. Shortening the list of documents by reducing content or removing irrelevant documents entirely

The following example demonstrates how to use the Text Reranking NIM as a document compressor with LangChain.

First initialize an embedding model to embed the query and document chunks. There are two examples for this. One using the NVIDIA API Catalog and the other using the Text Embedding NIM.

Option 1: NVIDIA API Catalog

To use the NVIDIA API Catalog, set the NVIDIA_API_KEY as an environmental variable if you didn’t already set it when initializing the LLM in the first part of this playbook. See NGC Authentication for information about on how to generate an API Key.

Copy
Copied!
            

import os from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings # Set the NVIDIA_API_KEY environment variable if it hasn't been set already # os.environ["NVIDIA_API_KEY"] = "nvapi-***" embedding_model = NVIDIAEmbeddings(model="nvidia/nv-embedqa-e5-v5")

Option 2: Text Embedding NIM

To use Text Embedding NIM, follow the instructions in Getting Started. After you have deployed the NIM on your infrastructure, you can access it using the NVIDIAEmbeddings class, as shown in the following example.

Copy
Copied!
            

from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings # Initialize and connect to a NeMo Retriever Text Embedding NIM running at localhost:8000 embedding_model = NVIDIAEmbeddings(model="nvidia/nv-embedqa-e5-v5", base_url="http://localhost:8000/v1")

Next, we’ll initialize a simple vector store retriever and store the document chunks of the NVIDIA H200 datasheet. LangChain provides support for a great selection of vector stores, we’ll be using FAISS for this example.

Copy
Copied!
            

from langchain_community.vectorstores import FAISS retriever = FAISS.from_documents(document_chunks, embedding=embedding_model).as_retriever(search_kwargs={"k": 10})

Wrap the base retriever with a ContextualCompressionRetriever class, using NVRerank as a document compressor, as shown in the following example. As previously mentioned, nv-rerankqa-mistral-4b-v3 is used for this step, be sure to update model accordingly if a different Text Reranking NIM is being used.

Copy
Copied!
            

from langchain.retrievers import ContextualCompressionRetriever from langchain_nvidia_ai_endpoints import NVIDIARerank # Re-initialize and connect to a NeMo Retriever Text Reranking NIM running at localhost:8000 compressor = NVIDIARerank(model="nvidia/nv-rerankqa-mistral-4b-v3", base_url="http://localhost:8000/v1") compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=retriever )

Next, ask the LLM the same question about the “H” in NVIDIA H200 again but with the retrieval and reranking pipeline.

Copy
Copied!
            

from langchain.chains import RetrievalQA query = "What does the H in the NVIDIA H200 stand for?" chain = RetrievalQA.from_chain_type(llm=llm, retriever=compression_retriever) chain.invoke(query)

This command should produce output similar to the following:

{‘query’: ‘What does the H in the NVIDIA H200 stand for?’,

‘result’: ‘ Based on the provided context, the ‘H’ in NVIDIA H200 likely stands for “Hopper,” which is the name of the architecture on which the NVIDIA H200 Tensor Core GPU is based. This is inferred from the datasheet referring to the NVIDIA Hopper™ architecture.’}

Previous Support Matrix
Next Observability
© Copyright © 2024, NVIDIA Corporation. Last updated on Aug 6, 2024.