About Helm Pipelines
A Helm pipeline is a Kubernetes custom resource definition, helmpipelines.package.nvidia.com
, developed by NVIDIA to manage multiple Helm charts as a single resource.
The Enterprise RAG LLM Operator monitors for instances of the Helm pipeline instances to deploy and manage the lifecycle of the the software.
The primary purpose of the Operator and the Helm pipeline CRs is to manage RAG applications. RAG applications deployed by the Operator include the following software components:
NVIDIA NeMo Inference Microservice
The microservice provides GPU-accelerated access to state-of-the-art large language models (LLMs).
NVIDIA NeMo Retriever Embedding Microservice
The microservice provides GPU-accelerated access to state-of-the-art text embedding.
The Operator and the Helm pipeline CR also support managing the following optional software components:
RAG Playground
The application provides a user interface for entering queries that are answered by the inference microservice. The application also supports uploading documents that the embedding microservice processes and stores as embeddings in a vector database.
Chain Server
NVIDIA developed a chain server that communicates with the inference server. The server also supports retrieving embeddings from the vector database before submitting a query to the inference server to perform retrieval augmented generation.
Vector Database
The Chain Server supports connecting to either Milvus or pgvector. NVIDIA provides a sample RAG pipeline that deploys pgvector to simplify demonstrating the inference and embedding microservices.
The following figure shows a high-level overview of the software components and the communication between the components.
![pipeline-components.png](https://docscontent.nvidia.com/dims4/default/04722c6/2147483647/strip/true/crop/830x426+0+0/resize/830x426!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F0000018e-6130-d04c-a7fe-7f7a4b260000%2Fai-enterprise%2Frag-llm-operator%2F0.4.1%2F_images%2Fpipeline-components.png)
The primary benefit of a Helm pipeline is that the Operator manages the lifecycle of the software components, such as upgrading a microservice container image and model by editing the custom resource and applying the manifest.
The primary limitation of a Helm pipeline is limited flexibility. The custom resource must deploy an instance of the inference microservice and an instance of the embedding microservice. Using a remotely-hosted inference or embedding services are not supported.
If you deploy the optional software components, RAG Playground and Chain Server, the custom resource must deploy both. The Operator does not support deploying only one of those two components. The vector database is optional and can be disabled individually.
The following YAML file shows a sample Helm pipeline that deploys the NeMo microservices, RAG playground, Chain Server, and pgvector.
The spec.pipeline.repoEntry.chartValues
field corresponds to a values.yaml
for a Helm chart.
apiVersion: package.nvidia.com/v1alpha1
kind: HelmPipeline
metadata:
labels:
app.kubernetes.io/name: helmpipeline
app.kubernetes.io/instance: helmpipeline-sample
app.kubernetes.io/part-of: k8s-rag-operator
app.kubernetes.io/managed-by: kustomize
app.kubernetes.io/created-by: k8s-rag-operator
name: my-sample-pipeline
spec:
pipeline:
- repoEntry:
name: nemollm-inference
url: "file:///helm-charts/pipeline"
#url: "cm://rag-application/nemollm-inference"
chartSpec:
chart: "nemollm-inference"
wait: false
chartValues:
fullnameOverride: "nemollm-inference"
model:
name: llama-2-13b-chat
numGpus: 1
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-80GB-PCIe
resources:
limits:
nvidia.com/gpu: 1 # Number of GPUs to present to the running service
imagePullSecret:
# Leave blank, if no imagePullSecret is needed.
registry: "nvcr.io"
name: "ngc-secret"
# If set to false, the chart expects either a imagePullSecret
# with the name configured above to be present on the cluster or that no
# credentials are needed.
create: true
username: '$oauthtoken'
password: ""
# persist model to a PVC
persistence:
enabled: true
existingClaim: "nemollm-inference-pvc"
# Persistent Volume Storage Class
# If defined, storageClassName: <storageClass>
# If set to "-", storageClassName: "", which disables dynamic provisioning.
# If undefined (the default) or set to null, no storageClassName spec is
# set, choosing the default provisioner.
storageClass: ""
accessMode: ReadWriteOnce # If using an NFS or similar setup, you can use ReadWriteMany
size: 50Gi # size of claim in bytes (e.g. 8Gi)
annotations: {}
# persist model to a host path
hostPath:
enabled: false
path: /model-store-inference # Only required if hostPath is enabled -- path to the model-store-inference
# model init containers, select only one - if needed.
initContainers:
ngcInit: # disabled by default
imageName: nvcr.io/ohlfw0olaadg/ea-rag-examples/ngc-cli # should either have ngc cli pre-installed or wget + unzip pre-installed -- must not be musl-based (alpine)
imageTag: v3.37.1
secret: # name of kube secret for ngc keys named NGC_CLI_API_KEY (required) and NGC_DECRYPT_KEY (optional)
name: ngc-api-secret
create: true
apiKey: "" # NGC_CLI_API_KEY
decryptKey: "" # NGC_DECRYPT_KEY
env:
STORE_MOUNT_PATH: /model-store
NGC_CLI_ORG: ohlfw0olaadg # ngc org where model lives
NGC_CLI_TEAM: ea-participants # ngc team where model lives
NGC_MODEL_NAME: llama-2-13b-chat # model name in ngc
NGC_MODEL_VERSION: LLAMA-2-13B-CHAT-4K-FP16-1-A100.24.01 # model version in ngc
NGC_EXE: ngc # path to ngc cli, if pre-installed in container
DOWNLOAD_NGC_CLI: "false" # set to string 'true' if container should download and install ngc cli
NGC_CLI_VERSION: "3.37.1" # version of ngc cli to download (only matters if downloading)
TARFILE: "true" # tells the script to untar the model. defaults to "true" as LLM models are archived in NGC.
MODEL_NAME: LLAMA-2-13B-CHAT-4K-FP16-1-A100.24.01 # actual model name, once downloaded
extraInit: [] # Add any additional init containers your use case requires.
# - # full init container definition here
- repoEntry:
name: nemollm-embedding
url: "file:///helm-charts/pipeline"
#url: "cm://rag-application/nemollm-embedding"
chartSpec:
chart: "nemollm-embedding"
wait: false
chartValues:
fullnameOverride: "nemollm-embedding"
image:
repository: nvcr.io/ohlfw0olaadg/ea-participants/nemo-retriever-embedding-microservice
pullPolicy: IfNotPresent
# Tag overrides the image tag whose default is the chart appVersion.
tag: "24.01"
imagePullSecret:
# Leave blank, if no imagePullSecret is needed.
registry: "nvcr.io"
name: "ngc-secret"
# If set to false, the chart expects either a imagePullSecret
# with the name configured above to be present on the cluster or that no
# credentials are needed.
create: false
username: '$oauthtoken'
password: ""
resources:
limits:
nvidia.com/gpu: 1 # Number of GPUs to present to the running service
# persist model to a PVC
persistence:
enabled: true
existingClaim: "nemollm-embedding-pvc"
# Persistent Volume Storage Class
# If defined, storageClassName: <storageClass>
# If set to "-", storageClassName: "", which disables dynamic provisioning.
# If undefined (the default) or set to null, no storageClassName spec is
# set, choosing the default provisioner.
storageClass: ""
accessMode: ReadWriteOnce # If using an NFS or similar setup, you can use ReadWriteMany
size: 50Gi # size of claim in bytes (e.g. 8Gi)
annotations: {}
# persist model to a host path
hostPath:
enabled: false
path: /model-store-embedding # Only required if hostPath is enabled -- path to the model-store-embedding
# model init containers, select only one - if needed.
initContainers:
ngcInit: # disabled by default
imageName: nvcr.io/ohlfw0olaadg/ea-rag-examples/ngc-cli # should either have ngc cli pre-installed or wget + unzip pre-installed -- must not be musl-based (alpine)
imageTag: v3.37.1
secret: # name of kube secret for ngc keys named NGC_CLI_API_KEY (required) and NGC_DECRYPT_KEY (optional)
name: ngc-api-secret
create: false
apiKey: "" # NGC_CLI_API_KEY
decryptKey: "" # NGC_DECRYPT_KEY
env:
STORE_MOUNT_PATH: /model-store
NGC_CLI_ORG: ohlfw0olaadg # ngc org where model lives
NGC_CLI_TEAM: ea-participants # ngc team where model lives
NGC_MODEL_ID: NV-Embed-QA # model ID for config file template used for TRT conversion
NGC_MODEL_NAME: nv-embed-qa # model name in ngc
NGC_MODEL_VERSION: "003" # model version in ngc
NGC_EXE: ngc # path to ngc cli, if pre-installed in container
DOWNLOAD_NGC_CLI: "false" # set to string 'true' if container should download and install ngc cli
NGC_CLI_VERSION: "3.37.1" # version of ngc cli to download (only matters if downloading)
TARFILE: "false" # tells the script to untar the model. defaults to "false" as embedding models are not archived in NGC.
MODEL_NAME: NV-Embed-QA-003.nemo # actual model name, once downloaded
extraInit: [] # Add any additional init containers your use case requires.
# - # full init container definition here
- repoEntry:
name: rag-llm-app
url: "file:///helm-charts/pipeline"
#url: "cm://rag-application/rag-llm-app"
chartSpec:
chart: "rag-llm-app"
wait: false
chartValues:
query:
image: nvcr.io/ohlfw0olaadg/ea-rag-examples/rag-application-text-chatbot:0.4.0
replicas: 1
resources:
limits:
nvidia.com/gpu: 1 # Number of GPUs to present to the running service
nodeSelector: {}
tolerations: {}
affinity: {}
env:
APP_VECTORSTORE_URL: "pgvector:5432"
APP_VECTORSTORE_NAME: "pgvector"
APP_LLM_SERVERURL: "nemollm-inference:8005" # openai port of inference service
APP_LLM_MODELNAME: llama-2-13b-chat
APP_LLM_MODELENGINE: nemo-infer
APP_EMBEDDINGS_SERVERURL: "nemollm-embedding:8080"
APP_EMBEDDINGS_MODELNAME: NV-Embed-QA
APP_EMBEDDINGS_MODELENGINE: nemo-embed
APP_CONFIG_FILE: /dev/null
NVAPI_KEY: ""
POSTGRES_PASSWORD: password
POSTGRES_USER: postgres
POSTGRES_DB: api
COLLECTION_NAME: canonical-rag
service:
type: ClusterIP
targetPort: 8081
ports:
- port: 8081
targetPort: http
protocol: TCP
name: http
frontend:
image: nvcr.io/ohlfw0olaadg/ea-rag-examples/rag-playground:0.4.0
replicas: 1
nodeSelector: {}
tolerations: {}
affinity: {}
env:
- name: APP_MODELNAME
value: "llama-2-13b-chat"
- name: APP_SERVERPORT
value: "8081"
- name: APP_SERVERURL
value: http://query-router
- name: RIVA_API_URI
value: ""
- name: RIVA_API_KEY
value: ""
- name: RIVA_FUNCTION_ID
value: ""
- name: TTS_SAMPLE_RATE
value: 48000
service:
type: NodePort
targetPort: 8090
ports:
- port: 8090
targetPort: http
protocol: TCP
name: http
pgvector:
image: ankane/pgvector:v0.5.1
replicas: 1
resources: {}
nodeSelector: {}
tolerations: {}
affinity: {}
env:
- name: POSTGRES_DB
value: "api"
- name: POSTGRES_PASSWORD
value: "password"
- name: POSTGRES_USER
value: "postgres"
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
service:
type: ClusterIP
targetPort: 5432
ports:
- port: 5432
targetPort: http
protocol: TCP
name: http
# persist data to a persistent volume
persistence:
enabled: true
existingClaim: "pgvector-pvc"
# Persistent Volume Storage Class
# If defined, storageClassName: <storageClass>
# If set to "-", storageClassName: "", which disables dynamic provisioning.
# If undefined (the default) or set to null, no storageClassName spec is
# set, choosing the default provisioner.
storageClass: ""
accessMode: ReadWriteOnce # If using an NFS or similar setup, you can use ReadWriteMany
size: 50Gi # size of claim in bytes (e.g. 8Gi)
annotations: {}
hostPath:
enabled: false
path: /pgvector-data-store # Only required if hostPath is enabled -- path to the pgvector data-store