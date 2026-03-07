Overview of NVIDIA NeMo Retriever Reranking NIM#
The NVIDIA NeMo Retriever Reranking API provide easy access to state-of-the-art models that are foundational building blocks for enterprise semantic search applications, delivering accurate answers quickly at scale. Developers can use these APIs to create robust copilots, chatbots, and AI assistants from start to finish. NeMo Retriever Reranking models are built on the NVIDIA software platform, incorporating CUDA, TensorRT, and Triton to offer out-of-the-box GPU acceleration.
The following NeMo Retriever microservices provide superior natural language processing and understanding, boosting retrieval performance:
NeMo Retriever Embedding NIM - Boosts question-answering retrieval performance, providing high-quality embeddings for many downstream NLP tasks.
NeMo Retriever Reranking NIM - Enhances the retrieval performance further with a fine-tuned reranker, finding the most relevant passages to provide as context when querying an LLM.
The following diagram shows how the NeMo Retriever Reranking API can help a RAG-based application find relevant data based upon Q&A for an Enterprise purpose.
NeMo Retriever Reranking NIM#
NVIDIA NeMo Retriever Reranking NIM (Reranking NIM) reorders citations by how well they match a query. This is a key step in the retrieval process, especially when the retrieval pipeline involves citations from different datastores that each have their own algorithms for measuring similarity.
Architecture#
Each NeMo Retriever Reranking NIM packages a reranking model, such as
NV-RerankQA-Mistral4B-v3, into a Docker container image. All Reranking NIM Docker containers are accelerated with NVIDIA TritonTM Inference Server and expose an API compatible with OpenAI’s API standard. The Reranking NIM offers a fine-tuned, ready-to-use reranker based on Mistral 7B, with a reduced size of 3.5B parameters by using only half of the weights. For a full list of supported models, see Supported Matrix.
Note
We round up to the closest integer in our naming convention, so you might see mention of Mistral4B in other documentation and in URLs and filenames.
Enterprise-Ready Features#
The Reranking NIM comes with enterprise-ready features, such as a high-performance inference server, flexible integration, and enterprise-grade security.
High Performance – The Reranking NIM is optimized for high-performance deep learning inference with NVIDIA TensorRTTM and NVIDIA TritonTM Inference Server.
Scalable Deployment – The Reranking NIM seamlessly scales from a few users to millions.
Flexible Integration – The Reranking NIM can be easily incorporated into existing data pipelines and applications. Developers are provided with an OpenAI-compatible API in addition to custom NVIDIA extensions.
Enterprise-Grade Security – The Reranking NIM comes with security features such as the use of safetensors, continuous patching of CVEs, and constant monitoring with our internal penetration tests.
Trade-Offs#
Reranking is essential in hybrid retrieval situations, as it helps to combine results from different sources of data when there is no easy way to merge them. In a hybrid pipeline with multiple sources of retrieved documents, this becomes evident as a dense nearest neighbor search uses a score like cosine similarity, whereas a sparse retriever like Elasticsearch might use BM25 scores. Given the disparity in scoring scales, a simple reordering based on them alone is not enough to ensure optimal relevance.
|
Pipeline
|
Avg recall@5
|
Dense
|
0.5699
|
Dense + Reranking
|
0.7070
|
Dense + Sparse
|
0.5856
|
Dense + Sparse + Reranking
|
0.7137
The above table shows that reranking improves recall not only when combining data sources, but also when only using a single datastore. For those experiments, the datastores have a top_k of 100. Ave recall@5 is an average of scores across FiQA, HotpotQA, BEIR, NQ, and TechQA.
In exchange for that improvement, the trade-off is increased cost and latency. On an H100, reranking 500 passages will cost ~1,750ms.