Deploy the Blueprint#
Create Required Secrets#
To deploy the secrets required by the VSS Blueprint:
Note
If using microk8s, prepend the kubectl
commands with sudo microk8s
. For example, sudo microk8s kubectl ...
.
To join the group for admin access, avoid using sudo, and other information about microk8s setup/usage, review: https://microk8s.io/docs/getting-started.
If not using microk8s, you may use kubectl
directly. For example, kubectl get pod
.
# Create the NGC image pull secret
sudo microk8s kubectl create secret docker-registry ngc-docker-reg-secret --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password=$NGC_API_KEY
# Create the neo4j db credentials secret
sudo microk8s kubectl create secret generic graph-db-creds-secret --from-literal=username=neo4j --from-literal=password=password
# Create NGC Secret
sudo microk8s kubectl create secret generic ngc-api-key-secret --from-literal=NGC_API_KEY=$NGC_API_KEY
Deploy the Helm Chart#
To deploy the VSS Blueprint Helm Chart:
# Fetch the VSS Blueprint Helm Chart
sudo microk8s helm fetch https://helm.ngc.nvidia.com/nvidia/blueprint/charts/nvidia-blueprint-vss-2.2.0.tgz --username='$oauthtoken' --password=$NGC_API_KEY
# Install the Helm Chart
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.2.0.tgz --set global.ngcImagePullSecretName=ngc-docker-reg-secret
Note
When running on an L40 / L40S system, the default startup probe timeout might not be enough for
the VILA model to be downloaded and its TRT-LLM engine to be built. To prevent startup timeout, increase
the startup timeout by passing --set vss.applicationSpecs.vss-deployment.containers.vss.startupProbe.failureThreshold=360
to the helm install command or the following to the overrides file.
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
startupProbe:
failureThreshold: 360
Note
This project downloads and installs additional third-party open source software projects. Review the license terms of these open source projects before use.
By default, Kubernetes schedules the pods based on resource availability. To explicitly schedule a particular pod on a particular node, run the following command. This command schedules the embedding and reranking services on the second node.
sudo microk8s kubectl get node
helm install vss-blueprint nvidia-blueprint-vss-2.2.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret \
--set nemo-embedding.nodeSelector."kubernetes\.io/hostname"="<Name of Node #2>" \
--set nemo-rerank.nodeSelector."kubernetes\.io/hostname"="<Name of Node #2>"
Wait for all services to be up. This may take some time (5-15 minutes or more) depending on the setup. Make sure all pods are in Running or Completed STATUS and shows 1/1 as READY. You can monitor the services using the following command:
sudo watch microk8s kubectl get pod
To make sure the VSS UI is ready and accessible, please check logs for deployment using command:
sudo microk8s kubectl logs vss-vss-deployment-POD-NAME
Please make sure the below logs are present and user does not see any errors:
Application startup complete.
Uvicorn running on http://0.0.0.0:9000
Default Deployment Topology and Models in Use#
The default deployment topology is as follows.
This is the topology that you see when checking deployment status using sudo microk8s kubectl get pod
.
Microservice/Pod |
Description |
Default #GPU Allocation |
---|---|---|
vss-blueprint-0 |
The NIM LLM (llama-3.1). |
4 |
vss-vss-deployment |
VSS Ingestion Pipeline (VLM: default is VILA1.5) + Retrieval Pipeline (CA RAG/LLM: llama3.1 + nemo embedding + nemo reranking) |
2 |
nemo-embedding-embedding-deployment |
NeMo Embedding model used in Retrieval Pipeline |
1 |
nemo-rerank-ranking-deployment |
NeMo Reranking model used in Retrieval Pipeline |
1 |
etcd-etcd-deployment, milvus-milvus-deployment, neo4j-neo4j-deployment, |
Milvus and Neo4j databases used for vector and graph storage. |
NA |
Launch VSS UI#
Find the ports where the VSS REST API and UI server are running:
sudo microk8s kubectl get svc vss-service
vss-service NodePort <CLUSTER_IP> <none> 8000:32114/TCP,9000:32206/TCP 12m
The NodePorts corresponding to port 8000 and 9000 are for the REST API (VSS_API_ENDPOINT) and the UI
respectively.
The VSS UI is available at the NodePort corresponding to 9000.
In this example, the VSS UI can be accessed by opening http://<NODE_IP>:32206
in a browser. VSS REST API address is http://<NODE_IP>:32114
.
Note
The <NODE_IP> is the IP address of the machine where the Helm Chart or pod by the name: vss-vss-deployment
is deployed.
Next, test the deployment by summarizing a sample video. Continue to Configuration Options and VSS Customization when you are ready to customize VSS.
If running into any errors, please refer to the sections FAQ and Known Issues.
Configuration Options#
Some options in the Helm Chart can be configured using the Helm overrides file.
An example of the overrides.yaml
file:
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
image:
repository: nvcr.io/nvidia/blueprint/vss-engine
tag: 2.2.0 # Update to override with custom VSS image
env:
- name: VLM_MODEL_TO_USE
value: vila-1.5 # Or "openai-compat" or "custom" or "nvila"
# Specify path in case of VILA-1.5 and custom model. Can be either
# a NGC resource path or a local path. For custom models this
# must be a path to the directory containing "inference.py" and
# "manifest.yaml" files.
- name: MODEL_PATH
value: "ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8"
- name: DISABLE_GUARDRAILS
value: "false" # "true" to disable guardrails.
- name: TRT_LLM_MODE
value: "" # int4_awq (default), int8 or fp16. (for VILA only)
- name: VLM_BATCH_SIZE
value: "" # Default is determined based on GPU memory. (for VILA only)
- name: VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME
value: "" # Set to use a VLM exposed as a REST API with OpenAI compatible API (e.g. gpt-4o)
- name: VIA_VLM_ENDPOINT
value: "" # Default OpenAI API. Override to use a custom API
- name: VIA_VLM_API_KEY
value: "" # API key to set when calling VIA_VLM_ENDPOINT
- name: OPENAI_API_VERSION
value: ""
- name: AZURE_OPENAI_API_VERSION
value: ""
resources:
limits:
nvidia.com/gpu: 2 # Set to 8 for 2 x 8H100 node deployment
# nodeSelector:
# kubernetes.io/hostname: <node-1>
nim-llm:
resources:
limits:
nvidia.com/gpu: 4
# nodeSelector:
# kubernetes.io/hostname: <node-2>
nemo-embedding:
resources:
limits:
nvidia.com/gpu: 1 # Set to 2 for 2 x 8H100 node deployment
# nodeSelector:
# kubernetes.io/hostname: <node-2>
nemo-rerank:
resources:
limits:
nvidia.com/gpu: 1 # Set to 2 for 2 x 8H100 node deployment
# nodeSelector:
# kubernetes.io/hostname: <node-2>
To apply the overrides file while deploying the VSS Helm Chart, run:
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.2.0.tgz --set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml
Note
If you have issues with multi-node deployment, try the following:
Try setting
nodeSelector
on each service as shown above when deploying.Try deleting existing PVCs
sudo microk8s kubectl delete pvc model-store-vss-blueprint-0 vss-ngc-model-cache-pvc
before redeploying.
Optional Deployment Topology with GPU sharing#
You can achieve below deployement topology by applying the Helm overides file.
Note
This is only for H100 / A100 (80+ GB device memory).
Microservice/Pod |
Description |
Default #GPU Allocation |
---|---|---|
vss-blueprint-0 |
The NIM LLM (llama-3.1). |
4 (index: 0,1,2,3) |
vss-vss-deployment |
VSS Ingestion Pipeline (VLM: default is VILA1.5) + Retrieval Pipeline (CA RAG/LLM: llama3.1 + nemo embedding + nemo reranking) |
4 (index: 4,5,6,7) |
nemo-embedding-embedding-deployment |
NeMo Embedding model used in Retrieval Pipeline |
1 (index: 4) |
nemo-rerank-ranking-deployment |
NeMo Reranking model used in Retrieval Pipeline |
1 (index: 4) |
etcd-etcd-deployment, milvus-milvus-deployment, neo4j-neo4j-deployment, |
Milvus and Neo4j databases used for vector and graph storage. |
NA |
The Helm overrides.yaml
file to change deployment topology with GPU sharing.
# 4 + 4 with GPU Sharing topology example
# in a host with 8xH100
# nim-llm on GPU 0,1,2,3
# vss(VLM) on GPU 4,5,6,7
# embedding on GPU 4
# reranking on GPU 4
nim-llm:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "0,1,2,3"
resources:
limits:
nvidia.com/gpu: 0 # no limit
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
env:
- name: VLM_MODEL_TO_USE
value: vila-1.5 # Or "openai-compat" or "custom" or "nvila"
- name: MODEL_PATH
value: "ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8"
- name: DISABLE_GUARDRAILS
value: "false" # "true" to disable guardrails.
- name: TRT_LLM_MODE
value: "" # int4_awq (default), int8 or fp16. (for VILA only)
- name: VLM_BATCH_SIZE
value: "" # Default is determined based on GPU memory. (for VILA only)
- name: VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME
value: "" # Set to use a VLM exposed as a REST API with OpenAI compatible API (e.g. gpt-4o)
- name: VIA_VLM_ENDPOINT
value: "" # Default OpenAI API. Override to use a custom API
- name: VIA_VLM_API_KEY
value: "" # API key to set when calling VIA_VLM_ENDPOINT
- name: OPENAI_API_VERSION
value: ""
- name: AZURE_OPENAI_API_VERSION
value: ""
- name: NVIDIA_VISIBLE_DEVICES
value: "4,5,6,7"
resources:
limits:
nvidia.com/gpu: 0 # no limit
nemo-embedding:
applicationSpecs:
embedding-deployment:
containers:
embedding-container:
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
key: NGC_API_KEY
name: ngc-api-key-secret
- name: NVIDIA_VISIBLE_DEVICES
value: '4'
resources:
limits:
nvidia.com/gpu: 0 # no limit
nemo-rerank:
applicationSpecs:
ranking-deployment:
containers:
ranking-container:
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
key: NGC_API_KEY
name: ngc-api-key-secret
- name: NVIDIA_VISIBLE_DEVICES
value: '4'
resources:
limits:
nvidia.com/gpu: 0 # no limit
To apply the overrides file while deploying the VSS Helm Chart, run:
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.2.0.tgz --set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml
Note
Limitation
It bypass the k8s device plugin resource allocation.
Helm chart might not work in managed k8s cluster services(e.g. AKS) where this env is not allowed.
For k8s where CDI is enabled this variable is ignored and GPUs are allocated randomly. So won’t work as expected.