About Helm Pipelines

Enterprise RAG LLM Operator - (Latest Version)

A Helm pipeline is a Kubernetes custom resource definition, helmpipelines.package.nvidia.com, developed by NVIDIA to manage multiple Helm charts as a single resource. The Enterprise RAG LLM Operator monitors for instances of the Helm pipeline instances to deploy and manage the lifecycle of the the software.

The primary purpose of the Operator and the Helm pipeline CRs is to manage RAG applications. RAG applications deployed by the Operator include the following software components:

  • NVIDIA NeMo Inference Microservice

    The microservice provides GPU-accelerated access to state-of-the-art large language models (LLMs).

  • NVIDIA NeMo Retriever Embedding Microservice

    The microservice provides GPU-accelerated access to state-of-the-art text embedding.

The Operator and the Helm pipeline CR also support managing the following optional software components:

  • RAG Playground

    The application provides a user interface for entering queries that are answered by the inference microservice. The application also supports uploading documents that the embedding microservice processes and stores as embeddings in a vector database.

  • Chain Server

    NVIDIA developed a chain server that communicates with the inference server. The server also supports retrieving embeddings from the vector database before submitting a query to the inference server to perform retrieval augmented generation.

  • Vector Database

    The Chain Server supports connecting to either Milvus or pgvector. NVIDIA provides a sample RAG pipeline that deploys pgvector to simplify demonstrating the inference and embedding microservices.

The following figure shows a high-level overview of the software components and the communication between the components.

pipeline-components.png

The primary benefit of a Helm pipeline is that the Operator manages the lifecycle of the software components, such as upgrading a microservice container image and model by editing the custom resource and applying the manifest.

The primary limitation of a Helm pipeline is limited flexibility. The custom resource must deploy an instance of the inference microservice and an instance of the embedding microservice. Using a remotely-hosted inference or embedding services are not supported.

If you deploy the optional software components, RAG Playground and Chain Server, the custom resource must deploy both. The Operator does not support deploying only one of those two components. The vector database is optional and can be disabled individually.

The following YAML file shows a sample Helm pipeline that deploys NVIDA NIM for LLMs with the vLLM backend, NVIDIA NeMo Retriever Embedding microservice, RAG playground, Chain Server, and pgvector.

The spec.pipeline.repoEntry.chartValues field corresponds to a values.yaml for a Helm chart.

Copy
Copied!
            

apiVersion: package.nvidia.com/v1alpha1 kind: HelmPipeline metadata: labels: app.kubernetes.io/name: helmpipeline app.kubernetes.io/instance: helmpipeline-sample app.kubernetes.io/part-of: k8s-rag-operator app.kubernetes.io/managed-by: kustomize app.kubernetes.io/created-by: k8s-rag-operator name: my-sample-pipeline spec: pipeline: - repoEntry: name: nemollm-inference url: "file:///helm-charts/pipeline" #url: "cm://rag-application/nemollm-inference" chartSpec: chart: "nemollm-inference" wait: false chartValues: fullnameOverride: "nemollm-inference" backend: "vllm" model: name: Llama-2-13b-chat-hf # LLM model name config: /model-store/model_config.yaml numGpus: 1 # num_workers: 1 vllm_config: engine: model: /model-store enforce_eager: false max_context_len_to_capture: 8192 max_num_seqs: 256 dtype: float16 nodeSelector: nvidia.com/gpu.product: NVIDIA-A100-80GB-PCIe resources: limits: nvidia.com/gpu: 1 # Number of GPUs to present to the running service image: repository: nvcr.io/ohlfw0olaadg/ea-participants/nim_llm pullPolicy: IfNotPresent tag: 24.02-day0 imagePullSecret: # Leave blank, if no imagePullSecret is needed. name: "ngc-secret" # persist model to a PVC persistence: enabled: true existingClaim: "nemollm-inference-pvc" # Persistent Volume Storage Class # If defined, storageClassName: <storageClass> # If set to "-", storageClassName: "", which disables dynamic provisioning. # If undefined (the default) or set to null, no storageClassName spec is # set, choosing the default provisioner. storageClass: "" accessMode: ReadWriteOnce # If using an NFS or similar setup, you can use ReadWriteMany size: 50Gi # size of claim in bytes (e.g. 8Gi) annotations: {} # StatefulSet Update Strategy. Accepted Values: RollingUpdate, OnDelete updateStrategy: type: RollingUpdate # persist model to a host path hostPath: enabled: false path: /model-store-inference # Only required if hostPath is enabled -- path to the model-store-inference # model init containers, select only one - if needed. initContainers: ngcInit: [] # disabled by default hfInit: imageName: bitnami/git imageTag: latest secret: # name of kube secret for hf with keys named HF_USER and HF_PAT name: hf-secret env: STORE_MOUNT_PATH: /model-store HF_MODEL_NAME: Llama-2-13b-chat-hf # HF model name HF_MODEL_ORG: meta-llama # HF org where model lives USE_SHALLOW_LFS_CLONE: 0 # Disable shallow LFS clone by default - repoEntry: name: nemollm-embedding url: "file:///helm-charts/pipeline" #url: "cm://rag-application/nemollm-embedding" chartSpec: chart: "nemollm-embedding" wait: false chartValues: fullnameOverride: "nemollm-embedding" image: repository: nvcr.io/ohlfw0olaadg/ea-participants/nemo-retriever-embedding-microservice pullPolicy: IfNotPresent # Tag overrides the image tag whose default is the chart appVersion. tag: 24.02 imagePullSecret: # Leave blank, if no imagePullSecret is needed. name: "ngc-secret" nodeSelector: {} resources: limits: nvidia.com/gpu: 1 # Number of GPUs to present to the running service # persist model to a PVC persistence: enabled: true existingClaim: "nemollm-embedding-pvc" # Persistent Volume Storage Class # If defined, storageClassName: <storageClass> # If set to "-", storageClassName: "", which disables dynamic provisioning. # If undefined (the default) or set to null, no storageClassName spec is # set, choosing the default provisioner. storageClass: "" accessMode: ReadWriteOnce # If using an NFS or similar setup, you can use ReadWriteMany size: 50Gi # size of claim in bytes (e.g. 8Gi) annotations: {} # StatefulSet Update Strategy. Accepted Values: RollingUpdate, OnDelete updateStrategy: type: RollingUpdate # persist model to a host path hostPath: enabled: false path: /model-store-embedding # Only required if hostPath is enabled -- path to the model-store-embedding # model init containers, select only one - if needed. initContainers: ngcInit: # disabled by default imageName: nvcr.io/ohlfw0olaadg/ea-participants/ngc-cli # should either have ngc cli pre-installed or wget + unzip pre-installed -- must not be musl-based (alpine) imageTag: v3.41.2 secret: # name of kube secret for ngc keys named NGC_CLI_API_KEY (required) and NGC_DECRYPT_KEY (optional) name: ngc-api-secret env: STORE_MOUNT_PATH: /model-store NGC_CLI_ORG: ohlfw0olaadg # ngc org where model lives NGC_CLI_TEAM: ea-participants # ngc team where model lives NGC_MODEL_ID: NV-Embed-QA # model ID for config file template used for TRT conversion NGC_MODEL_NAME: nv-embed-qa # model name in ngc NGC_MODEL_VERSION: "4" # model version in ngc NGC_EXE: ngc # path to ngc cli, if pre-installed in container DOWNLOAD_NGC_CLI: "false" # set to string 'true' if container should download and install ngc cli NGC_CLI_VERSION: "3.41.1" # version of ngc cli to download (only matters if downloading) MODEL_NAME: NV-Embed-QA-4.nemo # actual model name, once downloaded extraInit: [] # Add any additional init containers your use case requires. # - # full init container definition here - repoEntry: name: rag-llm-app url: "file:///helm-charts/pipeline" #url: "cm://rag-application/rag-llm-app" chartSpec: chart: "rag-llm-app" wait: false chartValues: query: # Deployment update strategy. Accepted values: RollingUpdate, Recreate deployStrategy: type: RollingUpdate image: nvcr.io/ohlfw0olaadg/ea-participants/rag-application-text-chatbot:24.03 replicas: 1 nodeSelector: {} tolerations: {} affinity: {} env: APP_VECTORSTORE_URL: "pgvector:5432" APP_VECTORSTORE_NAME: "pgvector" APP_LLM_SERVERURL: "nemollm-inference:8005" # openai port of inference service APP_LLM_MODELNAME: Llama-2-13b-chat-hf # HF model name APP_LLM_MODELENGINE: nvidia-ai-endpoints-nim APP_EMBEDDINGS_SERVERURL: "nemollm-embedding:8080" APP_EMBEDDINGS_MODELNAME: NV-Embed-QA APP_EMBEDDINGS_MODELENGINE: nemo-embed POSTGRES_PASSWORD: password POSTGRES_USER: postgres POSTGRES_DB: api COLLECTION_NAME: canonical_rag NVIDIA_API_KEY: "" service: type: ClusterIP targetPort: 8081 ports: - port: 8081 targetPort: http protocol: TCP name: http frontend: # Deployment update strategy. Accepted values: RollingUpdate, Recreate deployStrategy: type: RollingUpdate image: nvcr.io/ohlfw0olaadg/ea-participants/rag-playground:24.03 replicas: 1 nodeSelector: {} tolerations: {} affinity: {} env: - name: CHAIN_SERVER_PORT value: "8081" - name: CHAIN_SERVER value: http://chain-server service: type: NodePort targetPort: 3001 ports: - port: 3001 targetPort: http protocol: TCP name: http pgvector: enabled: true image: pgvector/pgvector:pg16 replicas: 1 resources: {} nodeSelector: {} tolerations: {} affinity: {} env: - name: POSTGRES_DB value: "api" - name: POSTGRES_PASSWORD value: "password" - name: POSTGRES_USER value: "postgres" - name: PGDATA value: /var/lib/postgresql/data/pgdata service: type: ClusterIP targetPort: 5432 ports: - port: 5432 targetPort: http protocol: TCP name: http # persist data to a persistent volume persistence: enabled: true existingClaim: "pgvector-pvc" # Persistent Volume Storage Class # If defined, storageClassName: <storageClass> # If set to "-", storageClassName: "", which disables dynamic provisioning. # If undefined (the default) or set to null, no storageClassName spec is # set, choosing the default provisioner. storageClass: "" accessMode: ReadWriteOnce # If using an NFS or similar setup, you can use ReadWriteMany size: 50Gi # size of claim in bytes (e.g. 8Gi) annotations: {} # StatefulSet Update Strategy. Accepted Values: RollingUpdate, OnDelete updateStrategy: type: RollingUpdate hostPath: enabled: false path: /pgvector-data-store # Only required if hostPath is enabled -- path to the pgvector data-store

Previous NVIDIA Enterprise RAG LLM Operator
Next Platform Support
© Copyright © 2024, NVIDIA Corporation. Last updated on May 21, 2024.