Features
The Operator is enhanced to deploy the NVIDIA NIM for LLMs microservice rather than NVIDIA Inference Microservice. This update adds support for vLLM backend that provides the flexibility to use the latest models and features that are not yet included in the TensorRT-LLM backend.
The vLLM backend enables deploying models from a model checkpoint from Hugging Face. Use the vLLM backend when you do not have a prebuilt model in NVIDIA NGC for your GPU model or number of GPUs in your system.
Refer to Supported Inference Models and GPU Requirements for more information.
NVIDIA NGC is updated with models and model versions to work with NVIDIA NIM for LLMs release 24.02. Now, the model version, such as a100x2_fp16_24.02, encodes the following information:
Required GPU model, such as A100.
Required GPU count, such as 2.
Model release, such as 24.02.
Refer to Changing the Inference Model for the TensorRT-LLM Backend for more information.
In previous releases, the Helm pipeline manifest file supported adding credentials for secrets in the manifest. Now, you need to create the secrets imperatively rather than enter credentials in a file.
The Install the RAG LLM Operator procedure includes the command to create the NGC secret.
The procedures in Sample RAG Pipeline include the commands to create secrets for NGC and Hugging Face.
Known Issues
For VMware vSphere with Tanzu clusters using vGPU software, to use an inference model that requires more than one GPU, the NVIDIA A100 or H100 GPUs must be connected with NVLink or NVLink Switch. These clusters also do not support multi-GPU models with L40S GPUs and vGPU software.
Modifying a Helm pipeline specification and applying the change might not roll out the change. Alternatively, you can roll out the change using the
kubectl rollout restart sts
command.The Operator is not verified in an air-gapped network environment.
Initial Release
The initial release of the NVIDIA Enterprise RAG LLM Operator enables NVIDIA AI Enterprise customers to deploy an Operator that manages the life cycle of the following key components for RAG pipelines:
NVIDIA Inference Microservice
NVIDIA NeMo Retriever Embedding Microservice
NVIDIA provides a sample RAG pipeline to demonstrate deploying an LLM model, pgvector as a sample vector database, a chat bot web application, and a query server that communicates with the microservices and the vector database.
Known Issues
Autoscaling the microservices is not operational. Alternatively, you can scale the microservices using the
kubectl scale sts --replicas=<n>
command.Modifying a Helm pipeline specification and applying the change might not roll out the change. Alternatively, you can roll out the change using the
kubectl rollout restart sts
command.The Operator is not verified in an air-gapped network environment.