Release Notes

Enterprise RAG LLM Operator - (Latest Version)

Features

  • The Operator is enhanced to deploy the NVIDIA NIM for LLMs microservice rather than NVIDIA Inference Microservice. This update adds support for vLLM backend that provides the flexibility to use the latest models and features that are not yet included in the TensorRT-LLM backend.

    The vLLM backend enables deploying models from a model checkpoint from Hugging Face. Use the vLLM backend when you do not have a prebuilt model in NVIDIA NGC for your GPU model or number of GPUs in your system.

    Refer to Supported Inference Models and GPU Requirements for more information.

  • NVIDIA NGC is updated with models and model versions to work with NVIDIA NIM for LLMs release 24.02. Now, the model version, such as a100x2_fp16_24.02, encodes the following information:

    • Required GPU model, such as A100.

    • Required GPU count, such as 2.

    • Model release, such as 24.02.

    Refer to Changing the Inference Model for the TensorRT-LLM Backend for more information.

  • In previous releases, the Helm pipeline manifest file supported adding credentials for secrets in the manifest. Now, you need to create the secrets imperatively rather than enter credentials in a file.

Known Issues

  • For VMware vSphere with Tanzu clusters using vGPU software, to use an inference model that requires more than one GPU, the NVIDIA A100 or H100 GPUs must be connected with NVLink or NVLink Switch. These clusters also do not support multi-GPU models with L40S GPUs and vGPU software.

  • Modifying a Helm pipeline specification and applying the change might not roll out the change. Alternatively, you can roll out the change using the kubectl rollout restart sts command.

  • The Operator is not verified in an air-gapped network environment.

Initial Release

The initial release of the NVIDIA Enterprise RAG LLM Operator enables NVIDIA AI Enterprise customers to deploy an Operator that manages the life cycle of the following key components for RAG pipelines:

  • NVIDIA Inference Microservice

  • NVIDIA NeMo Retriever Embedding Microservice

NVIDIA provides a sample RAG pipeline to demonstrate deploying an LLM model, pgvector as a sample vector database, a chat bot web application, and a query server that communicates with the microservices and the vector database.

Known Issues

  • Autoscaling the microservices is not operational. Alternatively, you can scale the microservices using the kubectl scale sts --replicas=<n> command.

  • Modifying a Helm pipeline specification and applying the change might not roll out the change. Alternatively, you can roll out the change using the kubectl rollout restart sts command.

  • The Operator is not verified in an air-gapped network environment.

Previous Connecting to a Vector Database
© Copyright © 2024, NVIDIA Corporation. Last updated on May 21, 2024.