About the Operator
The NVIDIA Enterprise Retrieval Augmented Generation (RAG) Large Language Model (LLM) Operator enables cluster administrators to operate the software components and services that are necessary to run RAG pipelines in Kubernetes.
The NVIDIA Enterprise RAG LLM Operator enables early access to an Operator that manages the life cycle of the following key components for RAG pipelines:
NVIDIA Inference Microservice
NVIDIA NeMo Retriever Embedding Microservice
NVIDIA provides a sample RAG pipeline to demonstrate deploying an LLM model, pgvector as a sample vector database, a chat bot web application, and a query server that communicates with the microservices and the vector database.
Early Access
The Operator is available for early access (EA) use only. EA releases are not supported in production environments and are not functionally complete. EA releases provide access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
You request early access by filling out the form at https://developer.nvidia.com/nemo-microservices-early-access.
Limitations
The Operator has the following limitations:
The Operator supports a single inference model and a single embedding model in a namespace. Using a single instance of the applications has two consequences:
To use more than one model, create another namespace and configure the pipeline in the new namespace.
The models are GPU-specific. For example, the Operator does not support a mix of some nodes with NVIDIA A100 GPUs and other nodes with NVIDIA H100 GPUs with a single instance of the microservices. To support nodes with different GPUs, create another namespace and specify the GPU model in the node selector when you deploy the pipeline.
When you specify an inference or embedding model for the RAG pipeline, the Operator configures an init container to download the model from NVIDIA NGC before the starting the microservice and using the model. The Operator does not support using local models that are not downloaded by the init container.
If all the GPUs in your cluster are allocated and you change a Helm pipeline that requires starting a new nemollm-inference, nemollm-embedding, or query pod, the new pods become stuck in a
Pending
state because Kubernetes cannot schedule the new pods to a node with an allocatable GPU resource.The query pod is managed as a Kubernetes deployment. You can change the deployment strategy to
Recreate
instead of the defaultRollingUpdate
strategy. The recreate strategy causes Kubernetes to delete currently running pods before starting new pods. Refer to Common Customizations.The NeMo microservices pods are managed as Kubernetes stateful sets. You must manually delete the currently running pods if no GPU resources are allocatable.
Licences
The following table identifies the licenses for the software components related to the Operator.
Component |
Artifact Type |
Artifact Licenses |
Source Code License |
---|---|---|---|
NVIDIA Enterprise RAG LLM Operator | Helm Chart | Apache 2 | Apache 2 |
NVIDIA Enterprise RAG LLM Operator | Image | NVIDIA Deep Learning Container License | Apache 2 |
NVIDIA Inference Microservice | Helm Chart | NVIDIA Proprietary | NVIDIA Proprietary |
NVIDIA NeMo Retriever Embedding Microservice | Helm Chart | NVIDIA Proprietary | NVIDIA Proprietary |
The NVIDIA Proprietary license is accessible to early-access participants only.
Third Party Software
The Chain Server that you can deploy with the sample Helm pipelines use third party software. You can download the Third Party Licenses.