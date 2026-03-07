LangChain Playbook for NVIDIA NeMo Retriever Reranking NIM#

This playbook demonstrates how to use NVIDIA NeMo Retriever Reranking NIM with LangChain for document compression and retrieval, and by using the NVIDIARerank class.

Reranking is crucial for achieving high accuracy and efficiency in retrieval pipelines. It plays a vital role, particularly when the pipeline incorporates citations from diverse datastores, where each datastore might employ its own unique similarity scoring algorithm. Reranking serves the following primary purposes:

Improve accuracy for individual citations within each datastore.

Integrate results from multiple datastores to provide a cohesive and relevant set of citations.

Notebook Requirements# Access to a supported GPU

Docker version 26.1.4 or later

NVIDIA NIM for LLMs or NVIDIA API Catalog

NeMo Retriever Reranking NIM deployed on your infrastructure

Python version 3.10.12 or later

Jupyter Notebook (optional)

Setup# Use the following bash script to install the required packages and dependencies necessary for this playbook. cat > requirements.txt << "EOF" faiss_cpu == 1 .8.0 fastapi == 0 .115.6 langchain == 0 .3.13 langchain-community == 0 .3.12 langchain-core == 0 .3.27 langchain-nvidia-ai-endpoints == 0 .3.7 numpy == 1 .26.4 sentence-transformers == 3 .3.1 unstructured == 0 .16.11 EOF pip install -r requirements.txt

Use NVIDIA API Catalog or NVIDIA NIM for LLMs# Use one of the following examples to initialize the LLM for this playbook. The first example uses the NVIDIA API Catalog; the second uses NVIDIA NIM for LLMs. You can access the chat model for either example using the ChatNVIDIA class from the langchain-nvidia-ai-endpoints package, which contains LangChain integrations for building applications with models on NVIDIA NIM for LLMs. For more information, see the ChatNVIDIA documentation. Option 1: NVIDIA API Catalog# To use the NVIDIA API Catalog, you’ll need to set the NVIDIA_API_KEY as an environmental variable. See NGC Authentication for information about generating and using an API Key. import os from langchain_nvidia_ai_endpoints import ChatNVIDIA os . environ [ "NVIDIA_API_KEY" ] = "nvapi-***" llm = ChatNVIDIA ( model = "meta/llama-3.1-8b-instruct" ) Option 2: NVIDIA NIM for LLMs# To use NVIDIA NIM for LLMs, follow the instructions in Getting Started. After you have deployed the NIM on your infrastructure, use the Python ChatNVIDIA class to access the NIM, as shown in the following example. from langchain_nvidia_ai_endpoints import ChatNVIDIA # connect to a LLM NIM running at localhost:8000, specifying a specific model llm = ChatNVIDIA ( base_url = "http://localhost:8000/v1" , model = "meta/llama-3.1-8b-instruct" ) After the LLM is ready, use LangChain’s ChatPromptTemplate class to structure multi-turn conversations and format inputs for the language model, as shown in the following example. from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser prompt = ChatPromptTemplate . from_messages ([ ( "system" , ( "You are a helpful and friendly AI!" "Your responses should be concise and no longer than two sentences." "Say you don't know if you don't have this information." )), ( "user" , " {question} " ) ]) chain = prompt | llm | StrOutputParser () To interact with the LLM in the LangChain Expression Language (LCEL) chain, use the invoke method, as shown in the following example. print ( chain . invoke ({ "question" : "What's the difference between a GPU and a CPU?" })) This query should produce output similar to the following: A GPU, or Graphics Processing Unit, is a specialized type of processor designed to quickly render and manipulate graphics and images. A CPU, or Central Processing Unit, is the primary processing component of a computer that performs most of the processing inside the computer. While CPUs are still important for general processing tasks, GPUs are better suited for parallel processing and are often used for tasks involving graphics, gaming, and machine learning. Next ask the following question about the NVIDIA H200 GPU. Since the knowledge cutoff for many LLMs is late 2022 or early 2023, the model might not have access to information after that timeframe. print ( chain . invoke ({ "question" : "What is the H in the NVIDIA H200?" })) I’m sorry, at the moment I don’t have information on what the ‘H’ in the NVIDIA H200 stands for. It could possibly be a model-specific identifier or code. You might want to check NVIDIA’s official documentation or contact them directly for clarification.

Reranking with NeMo Retriever Reranking NIM# To answer the previous question, build a simple retrieval and reranking pipeline to find the most relevant piece of information to the query. Load the NVIDIA H200 Datasheet to use in the retrieval pipeline. LangChain provides a variety of document loaders for various types of documents, such as HTML, PDF, and code, from sources and locations such as private S3 buckets and public websites. The following example uses a LangChain PyPDFLoader to load a datasheet about the NVIDIA H200 Tensor Core GPU. from langchain_community.document_loaders import PyPDFLoader loader = PyPDFLoader ( "https://nvdam.widen.net/content/udc6mzrk7a/original/hpc-datasheet-sc23-h200-datasheet-3002446.pdf" ) document = loader . load () document [ 0 ] Once documents have been loaded, they are often transformed. One method of transformation is known as chunking, which breaks down large pieces of text, such as a long document, into smaller segments. This technique is valuable because it helps optimize the relevance of the content returned from the vector database. LangChain provides a variety of document transformers, such as text splitters. The following example uses a RecursiveCharacterTextSplitter . The RecursiveCharacterTextSplitter is divides a large body of text into smaller chunks based on a specified chunk size. It employs recursion as its core mechanism for splitting text, utilizing a predefined set of characters, such as “



”, “

”, “ “, and “”, to determine where splits should occur. The process begins by attempting to split the text using the first character in the set. If the resulting chunks are still larger than the desired chunk size, it proceeds to the next character in the set and attempts to split again. This process continues until all chunks adhere to the specified maximum chunk size. There are some nuanced complexities to text splitting since, in theory, semantically related text should be kept together. from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter ( chunk_size = 1000 , chunk_overlap = 100 , separators = [ "



" , "

" , "." , ";" , "," , " " , "" ], ) document_chunks = text_splitter . split_documents ( document ) print ( "Number of chunks from the document:" , len ( document_chunks )) The following example uses LangChain to interact with NeMo Retriever Reranking NIM by using the NVIDIAReranking class from the same langchain-nvidia-ai-endpoints package as the previous example. Before you use this example, verify that NeMo Retriever Text Reranking NIM is running. This example uses nvidia/llama-3.2-nv-embedqa-1b-v2 . Update model if you use a different NeMo Retriever Reranking NIM. from langchain_nvidia_ai_endpoints import NVIDIARerank query = "What is the H in the NVIDIA H200?" # Initialize and connect to a NeMo Retriever Text Reranking NIM (nvidia/llama-nemotron-rerank-1b-v2) running at localhost:8002 reranker = NVIDIARerank ( model = "nvidia/llama-nemotron-rerank-1b-v2" , base_url = "http://localhost:8002/v1" ) reranked_chunks = reranker . compress_documents ( query = query , documents = document_chunks ) The next section shows the results of using NeMo Retriever Reranking NIM to rerank the document chunks based on a relevance score from the query to the document. for chunks in reranked_chunks : # Access the metadata of the document metadata = chunks . metadata # Get the page content page_content = chunks . page_content # Print the relevance score if it exists in the metadata, followed by page content if 'relevance_score' in metadata : print ( f "Relevance Score: { metadata [ 'relevance_score' ] } , Page Content: { page_content } ..." ) print ( f " { '-' * 100 } " ) This command should produce output similar to the following: Relevance Score: 11.390625, Page Content: NVIDIA H200 Tensor Core GPU | Datasheet | 1 NVIDIA H200 Tensor Core GPU Supercharging AI and HPC workloads. Higher Performance With Larger, Faster Memory The NVIDIA H200 Tensor Core GPU supercharges generative AI and high- performance computing (HPC) workloads with game-changing performance and memory capabilities. Based on the NVIDIA Hopper™ architecture, the NVIDIA H200 is the first GPU to offer 141 gigabytes (GB) of HBM3e memory at 4.8 terabytes per second (TB/s)— that’s nearly double the capacity of the NVIDIA H100 Tensor Core GPU with 1.4X more memory bandwidth. The H200’s larger and faster memory accelerates generative AI and large language models, while advancing scientific computing for HPC workloads with better energy efficiency and lower total cost of ownership. Unlock Insights With High-Performance LLM Inference In the ever-evolving landscape of AI, businesses rely on large language models to… Relevance Score: 10.25, Page Content: NVIDIA H200 Tensor Core GPU | Datasheet | 3 AI Acceleration for Mainstream Enterprise Servers With

H200 NVL NVIDIA H200 NVL is ideal for lower-power, air-cooled enterprise rack designs that require flexible configurations, delivering acceleration for every AI and HPC workload regardless of size. With up to four GPUs connected by NVIDIA NVLink™ and a 1.5X memory increase, large language model (LLM) inference can be accelerated up to 1.7X and HPC applications achieve up to 1.3X more performance over the H100 NVL. Enterprise-Ready: AI Software Streamlines Development and Deployment NVIDIA H200 NVL comes with a five-year NVIDIA AI Enterprise subscription and simplifies the way you build an enterprise AI-ready platform. H200 accelerates AI development and deployment for production-ready generative AI solutions, including computer vision, speech AI, retrieval augmented generation (RAG), and more. NVIDIA AI Enterprise includes NVIDIA NIM™, a set of easy-to-use… Relevance Score: 9.109375, Page Content: NVIDIA H200 Tensor Core GPU | Datasheet | 2 Supercharge High-Performance Computing Memory bandwidth is crucial for HPC applications, as it enables faster data transfer and reduces complex processing bottlenecks. For memory-intensive HPC applications like simulations, scientific research, and artificial intelligence, the H200’s higher memory bandwidth ensures that data can be accessed and manipulated efficiently, leading to 110X faster time to results. Preliminary specifications. May be subject to change. HPC MILC- dataset NERSC Apex Medium | HGX H200 4-GPU | dual Sapphire Rapids 8480 HPC Apps- CP2K: dataset H2O-32-RI-dRPA-96points | GROMACS: dataset STMV | ICON: dataset r2b5 | MILC: dataset NERSC Apex Medium | Chroma: dataset HMC Medium | Quantum Espresso: dataset AUSURF112 | 1x H100 SXM | 1x H200 SXM. Reduce Energy and TCO With the introduction of H200, energy efficiency and TCO reach new levels. This… Relevance Score: 9.109375, Page Content: NVIDIA MGX™ H200 NVL partner and NVIDIA-Certified Systems with up to 8 GPUs NVIDIA AI Enterprise Add-on Included 1. Preliminary specifications. May be subject to change. 2. With sparsity. Ready to Get Started? To learn more about the NVIDIA H200 Tensor Core GPU, visit nvidia.com/h200 © 2024 NVIDIA Corporation and affiliates. All rights reserved. NVIDIA, the NVIDIA logo, HGX, Hopper, MGX, NIM, NVIDIA-Certified Systems, and NVLink are trademarks and/or registered trademarks of NVIDIA Corporation and affiliates in the U.S. and other countries. Other company and product names may be trademarks of the respective owners with which they are associated. 3512650. NOV24… Relevance Score: 5.125, Page Content: With the introduction of H200, energy efficiency and TCO reach new levels. This cutting-edge technology offers unparalleled performance, all within the same power profile as the H100 Tensor Core GPU. AI factories and supercomputing systems that are not only faster but also more eco-friendly deliver an economic edge that propels the AI and scientific communities forward. Preliminary specifications. May be subject to change. Llama2 70B: ISL 2K, OSL 128 | Throughput | H100 SXM 1x GPU BS 8 | H200 SXM 1x GPU BS 32…