NVIDIA Enterprise RAG LLM Operator

Enterprise RAG LLM Operator - (Latest Version)

About the Operator

The NVIDIA Enterprise Retrieval Augmented Generation (RAG) Large Language Model (LLM) Operator enables cluster administrators to operate the software components and services that are necessary to run RAG pipelines in Kubernetes.

The NVIDIA Enterprise RAG LLM Operator enables early access to an Operator that manages the life cycle of the following key components for RAG pipelines:

  • NVIDIA Inference Microservice

  • NVIDIA NeMo Retriever Embedding Microservice

NVIDIA provides a sample RAG pipeline to demonstrate deploying an LLM model, pgvector as a sample vector database, a chat bot web application, and a query server that communicates with the microservices and the vector database.

Early Access

The Operator is available for early access (EA) use only. EA releases are not supported in production environments and are not functionally complete. EA releases provide access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

You request early access by filling out the form at https://developer.nvidia.com/nemo-microservices-early-access.


The Operator has the following limitations:

  • The Operator supports a single inference model and a single embedding model in a namespace. Using a single instance of the applications has two consequences:

    • To use more than one model, create another namespace and configure the pipeline in the new namespace.

    • The models are GPU-specific. For example, the Operator does not support a mix of some nodes with NVIDIA A100 GPUs and other nodes with NVIDIA H100 GPUs with a single instance of the microservices. To support nodes with different GPUs, create another namespace and specify the GPU model in the node selector when you deploy the pipeline.

  • When you specify an inference or embedding model for the RAG pipeline, the Operator configures an init container to download the model from NVIDIA NGC before the starting the microservice and using the model. The Operator does not support using local models that are not downloaded by the init container.

  • If all the GPUs in your cluster are allocated and you change a Helm pipeline that requires starting a new nemollm-inference, nemollm-embedding, or query pod, the new pods become stuck in a Pending state because Kubernetes cannot schedule the new pods to a node with an allocatable GPU resource.

    The query pod is managed as a Kubernetes deployment. You can change the deployment strategy to Recreate instead of the default RollingUpdate strategy. The recreate strategy causes Kubernetes to delete currently running pods before starting new pods. Refer to Common Customizations.

    The NeMo microservices pods are managed as Kubernetes stateful sets. You must manually delete the currently running pods if no GPU resources are allocatable.


The following table identifies the licenses for the software components related to the Operator.


Artifact Type

Artifact Licenses

Source Code License

NVIDIA Enterprise RAG LLM Operator Helm Chart Apache 2 Apache 2
NVIDIA Enterprise RAG LLM Operator Image NVIDIA Deep Learning Container License Apache 2
NVIDIA Inference Microservice Helm Chart NVIDIA Proprietary NVIDIA Proprietary
NVIDIA NeMo Retriever Embedding Microservice Helm Chart NVIDIA Proprietary NVIDIA Proprietary

The NVIDIA Proprietary license is accessible to early-access participants only.

© Copyright 2024, NVIDIA. Last updated on Mar 21, 2024.