About Helm Pipelines

Enterprise RAG LLM Operator - (Latest Version)

A Helm pipeline is a Kubernetes custom resource definition, helmpipelines.package.nvidia.com, developed by NVIDIA to manage multiple Helm charts as a single resource. The Enterprise RAG LLM Operator monitors for instances of the Helm pipeline instances to deploy and manage the lifecycle of the the software.

The primary purpose of the Operator and the Helm pipeline CRs is to manage RAG applications. RAG applications deployed by the Operator include the following software components:

  • NVIDIA NeMo Inference Microservice

    The microservice provides GPU-accelerated access to state-of-the-art large language models (LLMs).

  • NVIDIA NeMo Retriever Embedding Microservice

    The microservice provides GPU-accelerated access to state-of-the-art text embedding.

The Operator and the Helm pipeline CR also support managing the following optional software components:

  • RAG Playground

    The application provides a user interface for entering queries that are answered by the inference microservice. The application also supports uploading documents that the embedding microservice processes and stores as embeddings in a vector database.

  • Chain Server

    NVIDIA developed a chain server that communicates with the inference server. The server also supports retrieving embeddings from the vector database before submitting a query to the inference server to perform retrieval augmented generation.

  • Vector Database

    The Chain Server supports connecting to either Milvus or pgvector. NVIDIA provides a sample RAG pipeline that deploys pgvector to simplify demonstrating the inference and embedding microservices.

The following figure shows a high-level overview of the software components and the communication between the components.

pipeline-components.png

The primary benefit of a Helm pipeline is that the Operator manages the lifecycle of the software components, such as upgrading a microservice container image and model by editing the custom resource and applying the manifest.

The primary limitation of a Helm pipeline is limited flexibility. The custom resource must deploy an instance of the inference microservice and an instance of the embedding microservice. Using a remotely-hosted inference or embedding services are not supported.

If you deploy the optional software components, RAG Playground and Chain Server, the custom resource must deploy both. The Operator does not support deploying only one of those two components. The vector database is optional and can be disabled individually.

The following YAML file shows a sample Helm pipeline that deploys the NeMo microservices, RAG playground, Chain Server, and pgvector.

The spec.pipeline.repoEntry.chartValues field corresponds to a values.yaml for a Helm chart.

Copy
Copied!
            

apiVersion: package.nvidia.com/v1alpha1 kind: HelmPipeline metadata: labels: app.kubernetes.io/name: helmpipeline app.kubernetes.io/instance: helmpipeline-sample app.kubernetes.io/part-of: k8s-rag-operator app.kubernetes.io/managed-by: kustomize app.kubernetes.io/created-by: k8s-rag-operator name: my-sample-pipeline spec: pipeline: - repoEntry: name: nemollm-inference url: "file:///helm-charts/pipeline" #url: "cm://rag-application/nemollm-inference" chartSpec: chart: "nemollm-inference" wait: false chartValues: fullnameOverride: "nemollm-inference" model: name: llama-2-13b-chat numGpus: 1 nodeSelector: nvidia.com/gpu.product: NVIDIA-A100-80GB-PCIe resources: limits: nvidia.com/gpu: 1 # Number of GPUs to present to the running service imagePullSecret: # Leave blank, if no imagePullSecret is needed. registry: "nvcr.io" name: "ngc-secret" # If set to false, the chart expects either a imagePullSecret # with the name configured above to be present on the cluster or that no # credentials are needed. create: true username: '$oauthtoken' password: "" # persist model to a PVC persistence: enabled: true existingClaim: "nemollm-inference-pvc" # Persistent Volume Storage Class # If defined, storageClassName: <storageClass> # If set to "-", storageClassName: "", which disables dynamic provisioning. # If undefined (the default) or set to null, no storageClassName spec is # set, choosing the default provisioner. storageClass: "" accessMode: ReadWriteOnce # If using an NFS or similar setup, you can use ReadWriteMany size: 50Gi # size of claim in bytes (e.g. 8Gi) annotations: {} # persist model to a host path hostPath: enabled: false path: /model-store-inference # Only required if hostPath is enabled -- path to the model-store-inference # model init containers, select only one - if needed. initContainers: ngcInit: # disabled by default imageName: nvcr.io/ohlfw0olaadg/ea-rag-examples/ngc-cli # should either have ngc cli pre-installed or wget + unzip pre-installed -- must not be musl-based (alpine) imageTag: v3.37.1 secret: # name of kube secret for ngc keys named NGC_CLI_API_KEY (required) and NGC_DECRYPT_KEY (optional) name: ngc-api-secret create: true apiKey: "" # NGC_CLI_API_KEY decryptKey: "" # NGC_DECRYPT_KEY env: STORE_MOUNT_PATH: /model-store NGC_CLI_ORG: ohlfw0olaadg # ngc org where model lives NGC_CLI_TEAM: ea-participants # ngc team where model lives NGC_MODEL_NAME: llama-2-13b-chat # model name in ngc NGC_MODEL_VERSION: LLAMA-2-13B-CHAT-4K-FP16-1-A100.24.01 # model version in ngc NGC_EXE: ngc # path to ngc cli, if pre-installed in container DOWNLOAD_NGC_CLI: "false" # set to string 'true' if container should download and install ngc cli NGC_CLI_VERSION: "3.37.1" # version of ngc cli to download (only matters if downloading) TARFILE: "true" # tells the script to untar the model. defaults to "true" as LLM models are archived in NGC. MODEL_NAME: LLAMA-2-13B-CHAT-4K-FP16-1-A100.24.01 # actual model name, once downloaded extraInit: [] # Add any additional init containers your use case requires. # - # full init container definition here - repoEntry: name: nemollm-embedding url: "file:///helm-charts/pipeline" #url: "cm://rag-application/nemollm-embedding" chartSpec: chart: "nemollm-embedding" wait: false chartValues: fullnameOverride: "nemollm-embedding" image: repository: nvcr.io/ohlfw0olaadg/ea-participants/nemo-retriever-embedding-microservice pullPolicy: IfNotPresent # Tag overrides the image tag whose default is the chart appVersion. tag: "24.01" imagePullSecret: # Leave blank, if no imagePullSecret is needed. registry: "nvcr.io" name: "ngc-secret" # If set to false, the chart expects either a imagePullSecret # with the name configured above to be present on the cluster or that no # credentials are needed. create: false username: '$oauthtoken' password: "" resources: limits: nvidia.com/gpu: 1 # Number of GPUs to present to the running service # persist model to a PVC persistence: enabled: true existingClaim: "nemollm-embedding-pvc" # Persistent Volume Storage Class # If defined, storageClassName: <storageClass> # If set to "-", storageClassName: "", which disables dynamic provisioning. # If undefined (the default) or set to null, no storageClassName spec is # set, choosing the default provisioner. storageClass: "" accessMode: ReadWriteOnce # If using an NFS or similar setup, you can use ReadWriteMany size: 50Gi # size of claim in bytes (e.g. 8Gi) annotations: {} # persist model to a host path hostPath: enabled: false path: /model-store-embedding # Only required if hostPath is enabled -- path to the model-store-embedding # model init containers, select only one - if needed. initContainers: ngcInit: # disabled by default imageName: nvcr.io/ohlfw0olaadg/ea-rag-examples/ngc-cli # should either have ngc cli pre-installed or wget + unzip pre-installed -- must not be musl-based (alpine) imageTag: v3.37.1 secret: # name of kube secret for ngc keys named NGC_CLI_API_KEY (required) and NGC_DECRYPT_KEY (optional) name: ngc-api-secret create: false apiKey: "" # NGC_CLI_API_KEY decryptKey: "" # NGC_DECRYPT_KEY env: STORE_MOUNT_PATH: /model-store NGC_CLI_ORG: ohlfw0olaadg # ngc org where model lives NGC_CLI_TEAM: ea-participants # ngc team where model lives NGC_MODEL_ID: NV-Embed-QA # model ID for config file template used for TRT conversion NGC_MODEL_NAME: nv-embed-qa # model name in ngc NGC_MODEL_VERSION: "003" # model version in ngc NGC_EXE: ngc # path to ngc cli, if pre-installed in container DOWNLOAD_NGC_CLI: "false" # set to string 'true' if container should download and install ngc cli NGC_CLI_VERSION: "3.37.1" # version of ngc cli to download (only matters if downloading) TARFILE: "false" # tells the script to untar the model. defaults to "false" as embedding models are not archived in NGC. MODEL_NAME: NV-Embed-QA-003.nemo # actual model name, once downloaded extraInit: [] # Add any additional init containers your use case requires. # - # full init container definition here - repoEntry: name: rag-llm-app url: "file:///helm-charts/pipeline" #url: "cm://rag-application/rag-llm-app" chartSpec: chart: "rag-llm-app" wait: false chartValues: query: image: nvcr.io/ohlfw0olaadg/ea-rag-examples/rag-application-text-chatbot:0.4.0 replicas: 1 resources: limits: nvidia.com/gpu: 1 # Number of GPUs to present to the running service nodeSelector: {} tolerations: {} affinity: {} env: APP_VECTORSTORE_URL: "pgvector:5432" APP_VECTORSTORE_NAME: "pgvector" APP_LLM_SERVERURL: "nemollm-inference:8005" # openai port of inference service APP_LLM_MODELNAME: llama-2-13b-chat APP_LLM_MODELENGINE: nemo-infer APP_EMBEDDINGS_SERVERURL: "nemollm-embedding:8080" APP_EMBEDDINGS_MODELNAME: NV-Embed-QA APP_EMBEDDINGS_MODELENGINE: nemo-embed APP_CONFIG_FILE: /dev/null NVAPI_KEY: "" POSTGRES_PASSWORD: password POSTGRES_USER: postgres POSTGRES_DB: api COLLECTION_NAME: canonical-rag service: type: ClusterIP targetPort: 8081 ports: - port: 8081 targetPort: http protocol: TCP name: http frontend: image: nvcr.io/ohlfw0olaadg/ea-rag-examples/rag-playground:0.4.0 replicas: 1 nodeSelector: {} tolerations: {} affinity: {} env: - name: APP_MODELNAME value: "llama-2-13b-chat" - name: APP_SERVERPORT value: "8081" - name: APP_SERVERURL value: http://query-router - name: RIVA_API_URI value: "" - name: RIVA_API_KEY value: "" - name: RIVA_FUNCTION_ID value: "" - name: TTS_SAMPLE_RATE value: 48000 service: type: NodePort targetPort: 8090 ports: - port: 8090 targetPort: http protocol: TCP name: http pgvector: image: ankane/pgvector:v0.5.1 replicas: 1 resources: {} nodeSelector: {} tolerations: {} affinity: {} env: - name: POSTGRES_DB value: "api" - name: POSTGRES_PASSWORD value: "password" - name: POSTGRES_USER value: "postgres" - name: PGDATA value: /var/lib/postgresql/data/pgdata service: type: ClusterIP targetPort: 5432 ports: - port: 5432 targetPort: http protocol: TCP name: http # persist data to a persistent volume persistence: enabled: true existingClaim: "pgvector-pvc" # Persistent Volume Storage Class # If defined, storageClassName: <storageClass> # If set to "-", storageClassName: "", which disables dynamic provisioning. # If undefined (the default) or set to null, no storageClassName spec is # set, choosing the default provisioner. storageClass: "" accessMode: ReadWriteOnce # If using an NFS or similar setup, you can use ReadWriteMany size: 50Gi # size of claim in bytes (e.g. 8Gi) annotations: {} hostPath: enabled: false path: /pgvector-data-store # Only required if hostPath is enabled -- path to the pgvector data-store

Previous NVIDIA Enterprise RAG LLM Operator
Next Platform Support
© Copyright 2024, NVIDIA. Last updated on Mar 21, 2024.