Overview of NVIDIA NeMo Retriever Reranking NIM#
The NVIDIA NeMo Retriever Reranking API provide easy access to state-of-the-art models that are foundational building blocks for enterprise semantic search applications, delivering accurate answers quickly at scale. Developers can use these APIs to create robust copilots, chatbots, and AI assistants from start to finish. NeMo Retriever Reranking models are built on the NVIDIA software platform, incorporating CUDA to offer out-of-the-box GPU acceleration.
The following NeMo Retriever microservices provide superior natural language processing and understanding, boosting retrieval performance:
NeMo Retriever Embedding NIM - Boosts question-answering retrieval performance, providing high-quality embeddings for many downstream NLP tasks.
NeMo Retriever Reranking NIM - Enhances the retrieval performance further with a fine-tuned reranker, finding the most relevant passages to provide as context when querying an LLM.
The following diagram shows how the NeMo Retriever Reranking API can help a RAG-based application find relevant data based upon Q&A for an Enterprise purpose.
NeMo Retriever Reranking NIM#
NVIDIA NeMo Retriever Reranking NIM (Reranking NIM) reorders citations by how well they match a query. This is a key step in the retrieval process, especially when the retrieval pipeline involves citations from different datastores that each have their own algorithms for measuring similarity.
Architecture#
Each NeMo Retriever Reranking NIM packages a reranking model into a Docker container image and exposes ranking APIs that can be integrated into retrieval and RAG pipelines. Text reranking models score a text query against text passages. The VLM reranking model, nvidia/llama-nemotron-rerank-vl-1b-v2, scores a text query against text-only, image-only, or text-and-image passages, so applications can rerank document pages, slides, screenshots, and passages that include visual evidence. For a full list of supported models, see Support Matrix.
Note
For VLM reranking requests, the query is always text. Images are provided on passages.
Custom Model Artifacts#
Model artifacts typically consist of a framework, architecture, and weights. As long as the framework and architecture match a supported model, you can use your own custom weights.
NeMo Retriever Embedding NIM supports custom model artifacts for the models listed in Support Matrix.
To use pre-downloaded model artifacts,
stage a Hugging Face-style safetensors model directory on the host
and set NIM_MODEL_PATH to the in-container path for that directory.
For details, refer to Custom Model Artifact Support in NVIDIA NeMo Retriever Reranking NIM.
Enterprise-Ready Features#
The Reranking NIM comes with enterprise-ready features, such as a high-performance inference server, flexible integration, and enterprise-grade security.
High Performance – The Reranking NIM is optimized for high-performance deep learning inference with on NVIDIA GPUs.
Scalable Deployment – The Reranking NIM seamlessly scales from a few users to millions.
Flexible Integration – The Reranking NIM can be easily incorporated into existing data pipelines and applications. Developers are provided with an OpenAI-compatible API in addition to custom NVIDIA extensions.
Enterprise-Grade Security – The Reranking NIM comes with security features such as the use of safetensors, continuous patching of CVEs, and constant monitoring with our internal penetration tests.
Trade-Offs#
Reranking is essential in hybrid retrieval situations, as it helps to combine results from different sources of data when there is no easy way to merge them. In a hybrid pipeline with multiple sources of retrieved documents, this becomes evident as a dense nearest neighbor search uses a score like cosine similarity, whereas a sparse retriever like Elasticsearch might use BM25 scores. Given the disparity in scoring scales, a simple reordering based on them alone is not enough to ensure optimal relevance.
Pipeline |
Avg recall@5 |
|---|---|
Dense |
0.5699 |
Dense + Reranking |
0.7070 |
Dense + Sparse |
0.5856 |
Dense + Sparse + Reranking |
0.7137 |
The above table shows that reranking improves recall not only when combining data sources, but also when only using a single datastore. For those experiments, the datastores have a top_k of 100. Ave recall@5 is an average of scores across FiQA, HotpotQA, BEIR, NQ, and TechQA.
In exchange for that improvement, the trade-off is increased cost and latency. On an H100, reranking 500 passages will cost ~1,750ms.