Overview of NeMo Retriever Text Reranking NIM#

NeMo Text Retriever NIM (Text Retriever NIM) APIs provide easy access to state-of-the-art models that are foundational building blocks for enterprise semantic search applications, delivering accurate answers quickly at scale. Developers can use these APIs to create robust copilots, chatbots, and AI assistants from start to finish. Text Retriever NIM models are built on the NVIDIA software platform, incorporating CUDA, TensorRT, and Triton to offer out-of-the-box GPU acceleration.

The following NeMo Retriever microservices provide superior natural language processing and understanding, boosting retrieval performance:

Text Embedding NIM - Boosts text question-answering retrieval performance, providing high-quality embeddings for many downstream NLP tasks.
Text Reranking NIM - Enhances the retrieval performance further with a fine-tuned reranker, finding the most relevant passages to provide as context when querying an LLM.

The following diagram shows how Text Retriever NIM APIs can help a RAG-based application find relevant data based upon Q&A for an Enterprise purpose.

Text Reranking NIM#

NeMo Retriever Text Reranking NIM (Text Reranking NIM) reorders citations by how well they match a query. This is a key step in the retrieval process, especially when the retrieval pipeline involves citations from different datastores that each have their own algorithms for measuring similarity.

Architecture#

Each Text Reranking NIM packages a reranking model, such as NV-RerankQA-Mistral4B-v3, into a Docker container image. All Text Reranking NIM Docker containers are accelerated with NVIDIA TritonTM Inference Server and expose an API compatible with OpenAI’s API standard. Text Reranking NIM offer a fine-tuned, ready-to-use reranker based on Mistral 7B, with a reduced size of 3.5B parameters by using only half of the weights. For a full list of supported models, see Supported Matrix.

Note

We round up to the closest integer in our naming convention, so you might see mention of Mistral4B in other documentation and in URLs and filenames.

Enterprise-Ready Features#

Text Reranking NIM comes with enterprise-ready features, such as a high-performance inference server, flexible integration, and enterprise-grade security.

High Performance: Text Reranking NIM is optimized for high-performance deep learning inference with NVIDIA TensorRT^TM and NVIDIA Triton^TM Inference Server.
Scalable Deployment: Text Reranking NIMseamlessly scales from a few users to millions.
Flexible Integration: Text Reranking NIM can be easily incorporated into existing data pipelines and applications. Developers are provided with an OpenAI-compatible API in addition to custom NVIDIA extensions.
Enterprise-Grade Security: Text Reranking NIM comes with security features such as the use of safetensors, continuous patching of CVEs, and constant monitoring with our internal penetration tests.

Trade-Offs#

Reranking is essential in hybrid retrieval situations, as it helps to combine results from different sources of data when there is no easy way to merge them. In a hybrid pipeline with multiple sources of retrieved documents, this becomes evident as a dense nearest neighbor search uses a score like cosine similarity, whereas a sparse retriever like Elasticsearch might use BM25 scores. Given the disparity in scoring scales, a simple reordering based on them alone is not enough to ensure optimal relevance.

Pipeline	Avg recall@5
Dense	0.5699
Dense + Reranking	0.7070
Dense + Sparse	0.5856
Dense + Sparse + Reranking	0.7137

The above table shows that reranking improves recall not only when combining data sources, but also when only using a single datastore. For those experiments, the datastores have a top_k of 100. Ave recall@5 is an average of scores across FiQA, HotpotQA, BEIR, NQ, and TechQA.

In exchange for that improvement, the trade-off is increased cost and latency. On an H100, reranking 500 passages will cost ~1,750ms.