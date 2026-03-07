Overview of NVIDIA NeMo Retriever Embedding NIM#

The NVIDIA NeMo Retriever Embedding API provides easy access to state-of-the-art models that are foundational building blocks for enterprise semantic search applications, delivering accurate answers quickly at scale. Developers can use these APIs to create robust copilots, chatbots, and AI assistants from start to finish. NeMo Retriever Embedding models are built on the NVIDIA software platform, incorporating CUDA, TensorRT, and Triton to offer out-of-the-box GPU acceleration.

The following NeMo Retriever microservices provide superior natural language processing and understanding, boosting retrieval performance:

NeMo Retriever Embedding NIM - Boosts question-answering retrieval performance, providing high-quality embeddings for many downstream NLP tasks.

NeMo Retriever Reranking NIM - Enhances the retrieval performance further with a fine-tuned reranker, finding the most relevant passages to provide as context when querying an LLM.

The following diagram shows how the NeMo Retriever Embedding API can help a question-answering RAG application find the most relevant data in an enterprise setting.

NeMo Retriever Embedding NIM# NVIDIA NeMo Retriever Embedding NIM (Embedding NIM) brings the power of state-of-the-art text and image embedding models to your applications, offering unparalleled natural language processing and understanding capabilities. You can use the Embedding NIM for semantic search, Retrieval Augmented Generation (RAG), or any application that uses embeddings. The Embedding NIM is built on the NVIDIA software platform, incorporating CUDA, TensorRT, and Triton to offer out-of-the-box GPU acceleration. Architecture# Each NeMo Retriever Embedding NIM packages an embedding model, such as NV-EmbedQA-Mistral7B-v2, into a Docker container image. All NeMo Retriever Embedding NIM Docker containers are accelerated with NVIDIA TritonTM Inference Server and expose an API compatible with OpenAI’s API standard. For a full list of supported models, see Supported Models.

Enterprise-Ready Features# The Embedding NIM comes with enterprise-ready features, such as a high-performance inference server, flexible integration, and enterprise-grade security. High Performance – The Embedding NIM is optimized for high-performance deep learning inference with NVIDIA TensorRT TM and NVIDIA Triton TM Inference Server.

Scalable Deployment – The Embedding NIM seamlessly scales from a few users to millions.

Flexible Integration – The Embedding NIM can be easily incorporated into existing data pipelines and applications. Developers are provided with an OpenAI-compatible API in addition to custom NVIDIA extensions.

Enterprise-Grade Security – The Embedding NIM comes with security features such as the use of safetensors, continuous patching of CVEs, and constant monitoring with our internal penetration tests.