A Helm pipeline is a Kubernetes custom resource definition, helmpipelines.package.nvidia.com
, developed by NVIDIA to manage multiple Helm charts as a single resource. The Enterprise RAG LLM Operator monitors for instances of the Helm pipeline instances to deploy and manage the lifecycle of the the software.
The primary purpose of the Operator and the Helm pipeline CRs is to manage RAG applications. RAG applications deployed by the Operator include the following software components:
NVIDIA NeMo Inference Microservice
The microservice provides GPU-accelerated access to state-of-the-art large language models (LLMs).
NVIDIA NeMo Retriever Embedding Microservice
The microservice provides GPU-accelerated access to state-of-the-art text embedding.
The Operator and the Helm pipeline CR also support managing the following optional software components:
RAG Playground
The application provides a user interface for entering queries that are answered by the inference microservice. The application also supports uploading documents that the embedding microservice processes and stores as embeddings in a vector database.
Chain Server
NVIDIA developed a chain server that communicates with the inference server. The server also supports retrieving embeddings from the vector database before submitting a query to the inference server to perform retrieval augmented generation.
Vector Database
The Chain Server supports connecting to either Milvus or pgvector. NVIDIA provides a sample RAG pipeline that deploys pgvector to simplify demonstrating the inference and embedding microservices.
The following figure shows a high-level overview of the software components and the communication between the components.
![pipeline-components.png](https://docscontent.nvidia.com/dims4/default/c716c76/2147483647/strip/true/crop/661x314+0+0/resize/661x314!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F00000190-5b41-d6a7-ad9c-7fe95cea0000%2Fai-enterprise%2Frag-llm-operator%2F24.3.0%2F_images%2Fpipeline-components.png)
The primary benefit of a Helm pipeline is that the Operator manages the lifecycle of the software components, such as upgrading a microservice container image and model by editing the custom resource and applying the manifest.
The primary limitation of a Helm pipeline is limited flexibility. The custom resource must deploy an instance of the inference microservice and an instance of the embedding microservice. Using a remotely-hosted inference or embedding services are not supported.
If you deploy the optional software components, RAG Playground and Chain Server, the custom resource must deploy both. The Operator does not support deploying only one of those two components. The vector database is optional and can be disabled individually.
The following YAML file shows a sample Helm pipeline that deploys NVIDA NIM for LLMs with the vLLM backend, NVIDIA NeMo Retriever Embedding microservice, RAG playground, Chain Server, and pgvector.
The spec.pipeline.repoEntry.chartValues
field corresponds to a values.yaml
for a Helm chart.
apiVersion: package.nvidia.com/v1alpha1
kind: HelmPipeline
metadata:
labels:
app.kubernetes.io/name: helmpipeline
app.kubernetes.io/instance: helmpipeline-sample
app.kubernetes.io/part-of: k8s-rag-operator
app.kubernetes.io/managed-by: kustomize
app.kubernetes.io/created-by: k8s-rag-operator
name: my-sample-pipeline
spec:
pipeline:
- repoEntry:
name: nemollm-inference
url: "file:///helm-charts/pipeline"
#url: "cm://rag-application/nemollm-inference"
chartSpec:
chart: "nemollm-inference"
wait: false
chartValues:
fullnameOverride: "nemollm-inference"
backend: "vllm"
model:
name: Llama-2-13b-chat-hf # LLM model name
config: /model-store/model_config.yaml
numGpus: 1
# num_workers: 1
vllm_config:
engine:
model: /model-store
enforce_eager: false
max_context_len_to_capture: 8192
max_num_seqs: 256
dtype: float16
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-80GB-PCIe
resources:
limits:
nvidia.com/gpu: 1 # Number of GPUs to present to the running service
image:
repository: nvcr.io/ohlfw0olaadg/ea-participants/nim_llm
pullPolicy: IfNotPresent
tag: 24.02-day0
imagePullSecret:
# Leave blank, if no imagePullSecret is needed.
name: "ngc-secret"
# persist model to a PVC
persistence:
enabled: true
existingClaim: "nemollm-inference-pvc"
# Persistent Volume Storage Class
# If defined, storageClassName: <storageClass>
# If set to "-", storageClassName: "", which disables dynamic provisioning.
# If undefined (the default) or set to null, no storageClassName spec is
# set, choosing the default provisioner.
storageClass: ""
accessMode: ReadWriteOnce # If using an NFS or similar setup, you can use ReadWriteMany
size: 50Gi # size of claim in bytes (e.g. 8Gi)
annotations: {}
# StatefulSet Update Strategy. Accepted Values: RollingUpdate, OnDelete
updateStrategy:
type: RollingUpdate
# persist model to a host path
hostPath:
enabled: false
path: /model-store-inference # Only required if hostPath is enabled -- path to the model-store-inference
# model init containers, select only one - if needed.
initContainers:
ngcInit: [] # disabled by default
hfInit:
imageName: bitnami/git
imageTag: latest
secret: # name of kube secret for hf with keys named HF_USER and HF_PAT
name: hf-secret
env:
STORE_MOUNT_PATH: /model-store
HF_MODEL_NAME: Llama-2-13b-chat-hf # HF model name
HF_MODEL_ORG: meta-llama # HF org where model lives
USE_SHALLOW_LFS_CLONE: 0 # Disable shallow LFS clone by default
- repoEntry:
name: nemollm-embedding
url: "file:///helm-charts/pipeline"
#url: "cm://rag-application/nemollm-embedding"
chartSpec:
chart: "nemollm-embedding"
wait: false
chartValues:
fullnameOverride: "nemollm-embedding"
image:
repository: nvcr.io/ohlfw0olaadg/ea-participants/nemo-retriever-embedding-microservice
pullPolicy: IfNotPresent
# Tag overrides the image tag whose default is the chart appVersion.
tag: 24.02
imagePullSecret:
# Leave blank, if no imagePullSecret is needed.
name: "ngc-secret"
nodeSelector: {}
resources:
limits:
nvidia.com/gpu: 1 # Number of GPUs to present to the running service
# persist model to a PVC
persistence:
enabled: true
existingClaim: "nemollm-embedding-pvc"
# Persistent Volume Storage Class
# If defined, storageClassName: <storageClass>
# If set to "-", storageClassName: "", which disables dynamic provisioning.
# If undefined (the default) or set to null, no storageClassName spec is
# set, choosing the default provisioner.
storageClass: ""
accessMode: ReadWriteOnce # If using an NFS or similar setup, you can use ReadWriteMany
size: 50Gi # size of claim in bytes (e.g. 8Gi)
annotations: {}
# StatefulSet Update Strategy. Accepted Values: RollingUpdate, OnDelete
updateStrategy:
type: RollingUpdate
# persist model to a host path
hostPath:
enabled: false
path: /model-store-embedding # Only required if hostPath is enabled -- path to the model-store-embedding
# model init containers, select only one - if needed.
initContainers:
ngcInit: # disabled by default
imageName: nvcr.io/ohlfw0olaadg/ea-participants/ngc-cli # should either have ngc cli pre-installed or wget + unzip pre-installed -- must not be musl-based (alpine)
imageTag: v3.41.2
secret: # name of kube secret for ngc keys named NGC_CLI_API_KEY (required) and NGC_DECRYPT_KEY (optional)
name: ngc-api-secret
env:
STORE_MOUNT_PATH: /model-store
NGC_CLI_ORG: ohlfw0olaadg # ngc org where model lives
NGC_CLI_TEAM: ea-participants # ngc team where model lives
NGC_MODEL_ID: NV-Embed-QA # model ID for config file template used for TRT conversion
NGC_MODEL_NAME: nv-embed-qa # model name in ngc
NGC_MODEL_VERSION: "4" # model version in ngc
NGC_EXE: ngc # path to ngc cli, if pre-installed in container
DOWNLOAD_NGC_CLI: "false" # set to string 'true' if container should download and install ngc cli
NGC_CLI_VERSION: "3.41.1" # version of ngc cli to download (only matters if downloading)
MODEL_NAME: NV-Embed-QA-4.nemo # actual model name, once downloaded
extraInit: [] # Add any additional init containers your use case requires.
# - # full init container definition here
- repoEntry:
name: rag-llm-app
url: "file:///helm-charts/pipeline"
#url: "cm://rag-application/rag-llm-app"
chartSpec:
chart: "rag-llm-app"
wait: false
chartValues:
query:
# Deployment update strategy. Accepted values: RollingUpdate, Recreate
deployStrategy:
type: RollingUpdate
image: nvcr.io/ohlfw0olaadg/ea-participants/rag-application-text-chatbot:24.03
replicas: 1
nodeSelector: {}
tolerations: {}
affinity: {}
env:
APP_VECTORSTORE_URL: "pgvector:5432"
APP_VECTORSTORE_NAME: "pgvector"
APP_LLM_SERVERURL: "nemollm-inference:8005" # openai port of inference service
APP_LLM_MODELNAME: Llama-2-13b-chat-hf # HF model name
APP_LLM_MODELENGINE: nvidia-ai-endpoints-nim
APP_EMBEDDINGS_SERVERURL: "nemollm-embedding:8080"
APP_EMBEDDINGS_MODELNAME: NV-Embed-QA
APP_EMBEDDINGS_MODELENGINE: nemo-embed
POSTGRES_PASSWORD: password
POSTGRES_USER: postgres
POSTGRES_DB: api
COLLECTION_NAME: canonical_rag
NVIDIA_API_KEY: ""
service:
type: ClusterIP
targetPort: 8081
ports:
- port: 8081
targetPort: http
protocol: TCP
name: http
frontend:
# Deployment update strategy. Accepted values: RollingUpdate, Recreate
deployStrategy:
type: RollingUpdate
image: nvcr.io/ohlfw0olaadg/ea-participants/rag-playground:24.03
replicas: 1
nodeSelector: {}
tolerations: {}
affinity: {}
env:
- name: CHAIN_SERVER_PORT
value: "8081"
- name: CHAIN_SERVER
value: http://chain-server
service:
type: NodePort
targetPort: 3001
ports:
- port: 3001
targetPort: http
protocol: TCP
name: http
pgvector:
enabled: true
image: pgvector/pgvector:pg16
replicas: 1
resources: {}
nodeSelector: {}
tolerations: {}
affinity: {}
env:
- name: POSTGRES_DB
value: "api"
- name: POSTGRES_PASSWORD
value: "password"
- name: POSTGRES_USER
value: "postgres"
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
service:
type: ClusterIP
targetPort: 5432
ports:
- port: 5432
targetPort: http
protocol: TCP
name: http
# persist data to a persistent volume
persistence:
enabled: true
existingClaim: "pgvector-pvc"
# Persistent Volume Storage Class
# If defined, storageClassName: <storageClass>
# If set to "-", storageClassName: "", which disables dynamic provisioning.
# If undefined (the default) or set to null, no storageClassName spec is
# set, choosing the default provisioner.
storageClass: ""
accessMode: ReadWriteOnce # If using an NFS or similar setup, you can use ReadWriteMany
size: 50Gi # size of claim in bytes (e.g. 8Gi)
annotations: {}
# StatefulSet Update Strategy. Accepted Values: RollingUpdate, OnDelete
updateStrategy:
type: RollingUpdate
hostPath:
enabled: false
path: /pgvector-data-store # Only required if hostPath is enabled -- path to the pgvector data-store