About Helm Pipelines

What is a Helm Pipeline?

A Helm pipeline is a Kubernetes custom resource definition, helmpipelines.package.nvidia.com, developed by NVIDIA to manage multiple Helm charts as a single resource. The Enterprise RAG LLM Operator monitors for instances of the Helm pipeline instances to deploy and manage the lifecycle of the the software.

The primary purpose of the Operator and the Helm pipeline CRs is to manage RAG applications. RAG applications deployed by the Operator include the following software components:

NVIDIA NeMo Inference Microservice

The microservice provides GPU-accelerated access to state-of-the-art large language models (LLMs).
NVIDIA NeMo Retriever Embedding Microservice

The microservice provides GPU-accelerated access to state-of-the-art text embedding.

The Operator and the Helm pipeline CR also support managing the following optional software components:

RAG Playground

The application provides a user interface for entering queries that are answered by the inference microservice. The application also supports uploading documents that the embedding microservice processes and stores as embeddings in a vector database.
Chain Server

NVIDIA developed a chain server that communicates with the inference server. The server also supports retrieving embeddings from the vector database before submitting a query to the inference server to perform retrieval augmented generation.
Vector Database

The Chain Server supports connecting to either Milvus or pgvector. NVIDIA provides a sample RAG pipeline that deploys pgvector to simplify demonstrating the inference and embedding microservices.

The following figure shows a high-level overview of the software components and the communication between the components.

Benefits and Limitations of Helm Pipelines

The primary benefit of a Helm pipeline is that the Operator manages the lifecycle of the software components, such as upgrading a microservice container image and model by editing the custom resource and applying the manifest.

The primary limitation of a Helm pipeline is limited flexibility. The custom resource must deploy an instance of the inference microservice and an instance of the embedding microservice. Using a remotely-hosted inference or embedding services are not supported.

If you deploy the optional software components, RAG Playground and Chain Server, the custom resource must deploy both. The Operator does not support deploying only one of those two components. The vector database is optional and can be disabled individually.

Sample Helm Pipeline

The following YAML file shows a sample Helm pipeline that deploys NVIDA NIM for LLMs with the vLLM backend, NVIDIA NeMo Retriever Embedding microservice, RAG playground, Chain Server, and pgvector.

The spec.pipeline.repoEntry.chartValues field corresponds to a values.yaml for a Helm chart.

Copy
Copied!

            
            apiVersion: package.nvidia.com/v1alpha1
kind: HelmPipeline
metadata:
  labels:
    app.kubernetes.io/name: helmpipeline
    app.kubernetes.io/instance: helmpipeline-sample
    app.kubernetes.io/part-of: k8s-rag-operator
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/created-by: k8s-rag-operator
  name: my-sample-pipeline
spec:
  pipeline:
  - repoEntry:
      name: nemollm-inference
      url: "file:///helm-charts/pipeline"
      #url: "cm://rag-application/nemollm-inference"
    chartSpec:
      chart: "nemollm-inference"
      wait: false
    chartValues:
      fullnameOverride: "nemollm-inference"
      backend: "vllm"
      model:
        name: Llama-2-13b-chat-hf # LLM model name
        config: /model-store/model_config.yaml
        numGpus: 1
        # num_workers: 1
        vllm_config:
          engine:
            model: /model-store
            enforce_eager: false
            max_context_len_to_capture: 8192
            max_num_seqs: 256
            dtype: float16
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-A100-80GB-PCIe
      resources:
        limits:
          nvidia.com/gpu: 1  # Number of GPUs to present to the running service
      image:
        repository: nvcr.io/ohlfw0olaadg/ea-participants/nim_llm
        pullPolicy: IfNotPresent
        tag: 24.02-day0
      imagePullSecret:
        # Leave blank, if no imagePullSecret is needed.
        name: "ngc-secret"
      # persist model to a PVC
      persistence:
        enabled: true
        existingClaim: "nemollm-inference-pvc"
        # Persistent Volume Storage Class
        # If defined, storageClassName: <storageClass>
        # If set to "-", storageClassName: "", which disables dynamic provisioning.
        # If undefined (the default) or set to null, no storageClassName spec is
        # set, choosing the default provisioner.
        storageClass: ""
        accessMode: ReadWriteOnce  # If using an NFS or similar setup, you can use ReadWriteMany
        size: 50Gi  # size of claim in bytes (e.g. 8Gi)
        annotations: {}
      # StatefulSet Update Strategy. Accepted Values: RollingUpdate, OnDelete
      updateStrategy:
        type: RollingUpdate
      # persist model to a host path
      hostPath:
        enabled: false
        path: /model-store-inference  # Only required if hostPath is enabled -- path to the model-store-inference
      # model init containers, select only one - if needed.
      initContainers:
        ngcInit: [] # disabled by default
        hfInit:
          imageName: bitnami/git
          imageTag: latest
          secret: # name of kube secret for hf with keys named HF_USER and HF_PAT
            name: hf-secret
          env:
            STORE_MOUNT_PATH: /model-store
            HF_MODEL_NAME: Llama-2-13b-chat-hf # HF model name
            HF_MODEL_ORG: meta-llama # HF org where model lives
            USE_SHALLOW_LFS_CLONE: 0 # Disable shallow LFS clone by default
  - repoEntry:
      name: nemollm-embedding
      url: "file:///helm-charts/pipeline"
      #url: "cm://rag-application/nemollm-embedding"
    chartSpec:
      chart: "nemollm-embedding"
      wait: false
    chartValues:
      fullnameOverride: "nemollm-embedding"
      image:
        repository: nvcr.io/ohlfw0olaadg/ea-participants/nemo-retriever-embedding-microservice
        pullPolicy: IfNotPresent
        # Tag overrides the image tag whose default is the chart appVersion.
        tag: 24.02
      imagePullSecret:
        # Leave blank, if no imagePullSecret is needed.
        name: "ngc-secret"
      nodeSelector: {}
      resources:
        limits:
          nvidia.com/gpu: 1  # Number of GPUs to present to the running service
      # persist model to a PVC
      persistence:
        enabled: true
        existingClaim: "nemollm-embedding-pvc"
        # Persistent Volume Storage Class
        # If defined, storageClassName: <storageClass>
        # If set to "-", storageClassName: "", which disables dynamic provisioning.
        # If undefined (the default) or set to null, no storageClassName spec is
        # set, choosing the default provisioner.
        storageClass: ""
        accessMode: ReadWriteOnce  # If using an NFS or similar setup, you can use ReadWriteMany
        size: 50Gi  # size of claim in bytes (e.g. 8Gi)
        annotations: {}
      # StatefulSet Update Strategy. Accepted Values: RollingUpdate, OnDelete
      updateStrategy:
        type: RollingUpdate
      # persist model to a host path
      hostPath:
        enabled: false
        path: /model-store-embedding  # Only required if hostPath is enabled -- path to the model-store-embedding
      # model init containers, select only one - if needed.
      initContainers:
        ngcInit: # disabled by default
          imageName: nvcr.io/ohlfw0olaadg/ea-participants/ngc-cli # should either have ngc cli pre-installed or wget + unzip pre-installed -- must not be musl-based (alpine)
          imageTag: v3.41.2
          secret: # name of kube secret for ngc keys named NGC_CLI_API_KEY (required) and NGC_DECRYPT_KEY (optional)
            name: ngc-api-secret
          env:
            STORE_MOUNT_PATH: /model-store
            NGC_CLI_ORG: ohlfw0olaadg # ngc org where model lives
            NGC_CLI_TEAM: ea-participants # ngc team where model lives
            NGC_MODEL_ID: NV-Embed-QA # model ID for config file template used for TRT conversion
            NGC_MODEL_NAME: nv-embed-qa # model name in ngc
            NGC_MODEL_VERSION: "4" # model version in ngc
            NGC_EXE: ngc  # path to ngc cli, if pre-installed in container
            DOWNLOAD_NGC_CLI: "false"  # set to string 'true' if container should download and install ngc cli
            NGC_CLI_VERSION: "3.41.1"  # version of ngc cli to download (only matters if downloading)
            MODEL_NAME: NV-Embed-QA-4.nemo # actual model name, once downloaded
        extraInit: [] # Add any additional init containers your use case requires.
        # - # full init container definition here
  - repoEntry:
      name: rag-llm-app
      url: "file:///helm-charts/pipeline"
      #url: "cm://rag-application/rag-llm-app"
    chartSpec:
      chart: "rag-llm-app"
      wait: false
    chartValues:
      query:
        # Deployment update strategy. Accepted values: RollingUpdate, Recreate
        deployStrategy:
          type: RollingUpdate
        image: nvcr.io/ohlfw0olaadg/ea-participants/rag-application-text-chatbot:24.03
        replicas: 1
        nodeSelector: {}
        tolerations: {}
        affinity: {}
        env:
            APP_VECTORSTORE_URL: "pgvector:5432"
            APP_VECTORSTORE_NAME: "pgvector"
            APP_LLM_SERVERURL: "nemollm-inference:8005" # openai port of inference service
            APP_LLM_MODELNAME: Llama-2-13b-chat-hf # HF model name
            APP_LLM_MODELENGINE: nvidia-ai-endpoints-nim
            APP_EMBEDDINGS_SERVERURL: "nemollm-embedding:8080"
            APP_EMBEDDINGS_MODELNAME: NV-Embed-QA
            APP_EMBEDDINGS_MODELENGINE: nemo-embed
            POSTGRES_PASSWORD: password
            POSTGRES_USER: postgres
            POSTGRES_DB: api
            COLLECTION_NAME: canonical_rag
            NVIDIA_API_KEY: ""
        service:
            type: ClusterIP
            targetPort: 8081
            ports:
              - port: 8081
                targetPort: http
                protocol: TCP
                name: http
      frontend:
        # Deployment update strategy. Accepted values: RollingUpdate, Recreate
        deployStrategy:
          type: RollingUpdate
        image: nvcr.io/ohlfw0olaadg/ea-participants/rag-playground:24.03
        replicas: 1
        nodeSelector: {}
        tolerations: {}
        affinity: {}
        env:
          - name: CHAIN_SERVER_PORT
            value: "8081"
          - name: CHAIN_SERVER
            value: http://chain-server
        service:
            type: NodePort
            targetPort: 3001
            ports:
              - port: 3001
                targetPort: http
                protocol: TCP
                name: http
      pgvector:
        enabled: true
        image: pgvector/pgvector:pg16
        replicas: 1
        resources: {}
        nodeSelector: {}
        tolerations: {}
        affinity: {}
        env:
          - name: POSTGRES_DB
            value: "api"
          - name: POSTGRES_PASSWORD
            value: "password"
          - name: POSTGRES_USER
            value: "postgres"
          - name: PGDATA
            value: /var/lib/postgresql/data/pgdata
        service:
            type: ClusterIP
            targetPort: 5432
            ports:
              - port: 5432
                targetPort: http
                protocol: TCP
                name: http
        # persist data to a persistent volume
        persistence:
          enabled: true
          existingClaim: "pgvector-pvc"
          # Persistent Volume Storage Class
          # If defined, storageClassName: <storageClass>
          # If set to "-", storageClassName: "", which disables dynamic provisioning.
          # If undefined (the default) or set to null, no storageClassName spec is
          # set, choosing the default provisioner.
          storageClass: ""
          accessMode: ReadWriteOnce  # If using an NFS or similar setup, you can use ReadWriteMany
          size: 50Gi  # size of claim in bytes (e.g. 8Gi)
          annotations: {}
        # StatefulSet Update Strategy. Accepted Values: RollingUpdate, OnDelete
        updateStrategy:
          type: RollingUpdate
        hostPath:
          enabled: false
          path: /pgvector-data-store  # Only required if hostPath is enabled -- path to the pgvector data-store

Previous NVIDIA Enterprise RAG LLM Operator

Next Platform Support