Overview of NeMo Retriever Text Embedding NIM#

NeMo Text Retriever NIM (Text Retriever NIM) APIs provide easy access to state-of-the-art models that are foundational building blocks for enterprise semantic search applications, delivering accurate answers quickly at scale. Developers can use these APIs to create robust copilots, chatbots, and AI assistants from start to finish. Text Retriever NIM models are built on the NVIDIA software platform, incorporating CUDA, TensorRT, and Triton to offer out-of-the-box GPU acceleration.

The following NeMo Retriever microservices provide superior natural language processing and understanding, boosting retrieval performance:

Text Embedding NIM - Boosts text question-answering retrieval performance, providing high-quality embeddings for many downstream NLP tasks.
Text Reranking NIM - Enhances the retrieval performance further with a fine-tuned reranker, finding the most relevant passages to provide as context when querying an LLM.

The following diagram shows how Text Retriever NIM APIs can help a question-answering RAG application find the most relevant data in an enterprise setting.

Text Embedding NIM#

NeMo Retriever Text Embedding NIM (Text Embedding NIM) brings the power of state-of-the-art text embedding models to your applications, offering unparalleled natural language processing and understanding capabilities. You can use Text Embedding NIM for semantic search, Retrieval Augmented Generation (RAG), or any application that uses text embeddings. Text Embedding NIM is built on the NVIDIA software platform, incorporating CUDA, TensorRT, and Triton to offer out-of-the-box GPU acceleration.

Architecture#

Each Text Embedding NIM packages an embedding model, such as NV-EmbedQA-Mistral7B-v2, into a Docker container image. All Text Embedding NIM Docker containers are accelerated with NVIDIA Triton^TM Inference Server and expose an API compatible with OpenAI’s API standard.

For a full list of supported models, see Supported Models.

Enterprise-Ready Features#

Text Embedding NIM comes with enterprise-ready features, such as a high-performance inference server, flexible integration, and enterprise-grade security.

High Performance: Text Embedding NIM is optimized for high-performance deep learning inference with NVIDIA TensorRT^TM and NVIDIA Triton^TM Inference Server.
Scalable Deployment: Text Embedding NIM seamlessly scales from a few users to millions.
Flexible Integration: Text Embedding NIM can be easily incorporated into existing data pipelines and applications. Developers are provided with an OpenAI-compatible API in addition to custom NVIDIA extensions.
Enterprise-Grade Security: Text Embedding NIM comes with security features such as the use of safetensors, continuous patching of CVEs, and constant monitoring with our internal penetration tests.

Applications#

Retrieval Augmented Generation#

In a Retrieval Augmented Generation (RAG) application, we use the embedding model to encode the knowledge base (offline) and user question (online) into contextual embeddings, so that the LLM can retrieve the most relevant context and provide the users with the correct answers. We need high quality embeddings model to ensure high relevancy of the retrieved context.

1. Encoding the knowledge base (offline): Given a knowledge base containing documents in text, PDF, HTML, or other formats, we first split the knowledge base into chunks, then encode each chunk into a dense vector representation, also called embedding, using an embedding model. The resulting embeddings, along with their corresponding documents and other metadata, are saved in a vector database. The diagram below illustrates the knowledge base encoding process.

2. Deployment (online): Once deployed, the RAG application can access the vector database and answer questions in real time. To answer a user question, the RAG application first finds relevant chunks from the vector database, then it uses the retrieved chunks as context to generate a response.

Phase 1: Retrieval from the vector database based on the user’s query

The user’s query is first embedded as a dense vector using the embedding model. Then the query embedding is used to search the vector database for the most relevant document chunks with respect to the user’s query. The diagram below illustrates the retrieval process.
Phase 2: Use an LLM to generate a response leveraging the retrieved context

The most relevant chunks are joined to form the context for the user query. The LLM combines the context and the user query to generate a response. The diagram below illustrates the response generation process.

Text Classification and Clustering#

Text embeddings can be used for text classification tasks, such as sentiment analysis and topic categorization. They can also be used for clustering tasks, such as topic discovery and recommender systems. A high quality embedding model improves the performance of these tasks by capturing the contextual information in the dense vector representation.

Custom Applications#

The Text Embedding NIM API is designed to be versatile. Developers can leverage the text embeddings for a variety of applications based on their specific use cases, experiment and integrate the API seamlessly into their projects.