Deploy Using Helm#
Before proceeding, make sure all of the prerequisites have been met.
Create Required Secrets#
These secrets allow your Kubernetes applications to securely access NVIDIA resources and your database without hardcoding credentials in your application code or container images.
To deploy the secrets required by the VSS Blueprint:
Note
If using microk8s, prepend the kubectl
commands with sudo microk8s
. For example, sudo microk8s kubectl ...
.
To join the group for admin access, avoid using sudo, and other information about microk8s setup/usage, review: https://microk8s.io/docs/getting-started.
If not using microk8s, you may use kubectl
directly. For example, kubectl get pod
.
# Export NGC_API_KEY
export NGC_API_KEY=<YOUR_LEGACY_NGC_API_KEY>
# Create credentials for pulling images from NGC (nvcr.io)
sudo microk8s kubectl create secret docker-registry ngc-docker-reg-secret \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password=$NGC_API_KEY
# Configure login information for Neo4j graph database
sudo microk8s kubectl create secret generic graph-db-creds-secret \
--from-literal=username=neo4j --from-literal=password=password
# Configure the legacy NGC API key for downloading models from NGC
sudo microk8s kubectl create secret generic ngc-api-key-secret \
--from-literal=NGC_API_KEY=$NGC_API_KEY
Deploy the Helm Chart#
To deploy the VSS Blueprint Helm Chart:
# Fetch the VSS Blueprint Helm Chart
sudo microk8s helm fetch \
https://helm.ngc.nvidia.com/nvidia/blueprint/charts/nvidia-blueprint-vss-2.3.0.tgz \
--username='$oauthtoken' --password=$NGC_API_KEY
# Install the Helm Chart
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.3.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret
Note
When running on an L40 / L40S system, the default startup probe timeout might not be enough for
the VILA model to be downloaded and its TRT-LLM engine to be built. To prevent startup timeout, increase
the startup timeout by passing --set vss.applicationSpecs.vss-deployment.containers.vss.startupProbe.failureThreshold=360
to the helm install command or the following to the overrides file.
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
startupProbe:
failureThreshold: 360
Note
This project downloads and installs additional third-party open source software projects. Review the license terms of these open source projects before use.
Note
Audio and CV metadata are not enabled by default. To enable them, see Enabling Audio and Enabling CV Pipeline: Set-Of-Marks (SOM) & Metadata.
Note
Please check FAQ page for deployment failure scenarios with corresponding troubleshooting instructions and commands.
Wait for all services to be up. This may take some time (a few minutes to upto an hour) depending on the setup and configuration. Deploying second time onwards should be faster since the models are cached. Make sure all pods are in Running or Completed STATUS and shows 1/1 as READY. You can monitor the services using the following command:
sudo watch -n1 microk8s kubectl get pod
watch
will refresh the output every second. Wait for all pods to be in Running or Completed STATUS
and shows 1/1 as READY.
To make sure the VSS UI is ready and accessible, please check logs for deployment using command:
sudo microk8s kubectl logs -l app.kubernetes.io/name=vss
Please make sure the below logs are present and user does not see any errors:
Application startup complete.
Uvicorn running on http://0.0.0.0:9000
If a lot of time has passed since VSS started, kubectl logs
might have cleared older logs. In this case look for:
INFO: 10.78.15.132:48016 - "GET /health/ready HTTP/1.1" 200 OK
INFO: 10.78.15.132:50386 - "GET /health/ready HTTP/1.1" 200 OK
INFO: 10.78.15.132:50388 - "GET /health/live HTTP/1.1" 200 OK
Uninstalling the deployment#
To uninstall the deployment, run the following command:
sudo microk8s helm uninstall vss-blueprint
Default Deployment Topology and Models in Use#
The default deployment topology is as follows.
This is the topology that you see when checking deployment status using sudo microk8s kubectl get pod
.
Microservice/Pod |
Description |
Default #GPU Allocation |
---|---|---|
vss-blueprint-0 |
The NIM LLM (llama-3.1). |
4 |
vss-vss-deployment |
VSS Ingestion Pipeline (VLM: default is VILA1.5) + Retrieval Pipeline (CA-RAG) |
2 |
nemo-embedding-embedding-deployment |
NeMo Embedding model used in Retrieval Pipeline |
1 |
nemo-rerank-ranking-deployment |
NeMo Reranking model used in Retrieval Pipeline |
1 |
etcd-etcd-deployment, milvus-milvus-deployment, neo4j-neo4j-deployment, |
Milvus and Neo4j databases used for vector and graph storage. |
NA |
Launch VSS UI#
Find the ports where the VSS REST API and UI server are running:
sudo microk8s kubectl get svc vss-service
vss-service NodePort <CLUSTER_IP> <none> 8000:32114/TCP,9000:32206/TCP 12m
The NodePorts corresponding to port 8000 and 9000 are for the REST API (VSS_API_ENDPOINT) and the UI
respectively.
The VSS UI is available at the NodePort corresponding to 9000.
In this example, the VSS UI can be accessed by opening http://<NODE_IP>:32206
in a browser. VSS REST API address is http://<NODE_IP>:32114
.
Note
The <NODE_IP> is the IP address of the machine where the Helm Chart or pod by the name: vss-vss-deployment
is deployed.
Next, test the deployment by summarizing a sample video. Continue to Configuration Options and VSS Customization when you are ready to customize VSS.
If running into any errors, please refer to the sections FAQ and Known Issues.
Configuration Options#
Some options in the Helm Chart can be configured using the Helm overrides file.
These options include:
VSS deployment time configurations. More info: VSS Deployment-Time Configuration Glossary.
Changing the default models in the VSS helm chart. More info: Plug-and-Play Overview.
An example of the overrides.yaml
file:
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
# Update to override with custom VSS image
# Set imagePullSecrets if the custom image is hosted on a private registry
image:
repository: nvcr.io/nvidia/blueprint/vss-engine
tag: 2.3.0
env:
- name: VLM_MODEL_TO_USE
value: vila-1.5 # Or "openai-compat" or "custom" or "nvila"
# Specify path in case of VILA-1.5 and custom model. Can be either
# a NGC resource path or a local path. For custom models this
# must be a path to the directory containing "inference.py" and
# "manifest.yaml" files.
- name: MODEL_PATH
value: "ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8"
#- name: VILA_ENGINE_NGC_RESOURCE # Enable to use prebuilt engines from NGC
# value: "nvidia/blueprint/vss-vlm-prebuilt-engine:2.3.0-vila-1.5-40b-h100-sxm"
# - name: DISABLE_GUARDRAILS
# value: "false" # "true" to disable guardrails.
# - name: TRT_LLM_MODE
# value: "" # int4_awq (default), int8 or fp16. (for VILA only)
# - name: VLM_BATCH_SIZE
# value: "" # Default is determined based on GPU memory. (for VILA only)
# - name: VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME
# value: "" # Set to use a VLM exposed as a REST API with OpenAI compatible API (e.g. gpt-4o)
# - name: VIA_VLM_ENDPOINT
# value: "" # Default OpenAI API. Override to use a custom API
# - name: VIA_VLM_API_KEY
# value: "" # API key to set when calling VIA_VLM_ENDPOINT. Can be set from a secret.
# - name: OPENAI_API_VERSION
# value: ""
# - name: AZURE_OPENAI_ENDPOINT
# value: ""
# - name: AZURE_OPENAI_API_VERSION
# value: ""
# - name: VSS_LOG_LEVEL
# value: "info"
# - name: VSS_EXTRA_ARGS
# value: ""
# - name: INSTALL_PROPRIETARY_CODECS
# value: "true" # Requires root permissions in the container.
# - name: DISABLE_FRONTEND
# value: "false"
# - name: DISABLE_CA_RAG
# value: "false"
# - name: VLM_BATCH_SIZE # Applicable only to VILA-1.5 and NVILA models
# value: ""
# - name: ENABLE_VIA_HEALTH_EVAL
# value: "true"
# - name: ENABLE_DENSE_CAPTION
# value: "true"
# - name: VSS_DISABLE_LIVESTREAM_PREVIEW
# value: "1"
# - name: VSS_SKIP_INPUT_MEDIA_VERIFICATION
# value: "1"
# - name: VSS_RTSP_LATENCY
# value: "2000"
# - name: VSS_RTSP_TIMEOUT
# value: "2000"
resources:
limits:
nvidia.com/gpu: 2 # Set to 8 for 2 x 8H100 node deployment
# nodeSelector:
# kubernetes.io/hostname: <node-1>
# imagePullSecrets:
# - name: <imagePullSecretName>
nim-llm:
resources:
limits:
nvidia.com/gpu: 4
# nodeSelector:
# kubernetes.io/hostname: <node-2>
nemo-embedding:
resources:
limits:
nvidia.com/gpu: 1 # Set to 2 for 2 x 8H100 node deployment
# nodeSelector:
# kubernetes.io/hostname: <node-2>
nemo-rerank:
resources:
limits:
nvidia.com/gpu: 1 # Set to 2 for 2 x 8H100 node deployment
# nodeSelector:
# kubernetes.io/hostname: <node-2>
Note
The overrides.yaml
file must be created. The helm chart package does not include it.
Note
The overrides.yaml file provided by default have nvidia.com/gpu
limits set for each service to match the default helm chart deployment topology.
Please change this limit as needed by editing the value of nvidia.com/gpu
in the yaml file for resources / limits
section for each of the service - viz. vss, nim-llm, nemo-embedding, and nemo-rerank.
To apply the overrides file while deploying the VSS Helm Chart, run:
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.3.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml
Note
For an exhaustive list of all configuration options, see VSS Deployment-Time Configuration Glossary.
Optional Deployment Topology with GPU sharing#
You can achieve below deployement topology by applying the Helm overides file.
Note
This is only for H100 / H200 / A100 (80+ GB device memory).
Microservice/Pod |
Description |
Default #GPU Allocation |
---|---|---|
vss-blueprint-0 |
The NIM LLM (llama-3.1). |
4 (index: 0,1,2,3) |
vss-vss-deployment |
VSS Ingestion Pipeline (VLM: default is VILA1.5) + Retrieval Pipeline (CA-RAG) |
4 (index: 4,5,6,7) |
nemo-embedding-embedding-deployment |
NeMo Embedding model used in Retrieval Pipeline |
1 (index: 4) |
nemo-rerank-ranking-deployment |
NeMo Reranking model used in Retrieval Pipeline |
1 (index: 4) |
etcd-etcd-deployment, milvus-milvus-deployment, neo4j-neo4j-deployment, |
Milvus and Neo4j databases used for vector and graph storage. |
NA |
The Helm overrides.yaml
file to change deployment topology with GPU sharing.
# 4 + 4 with GPU Sharing topology example
# in a host with 8xH100
# nim-llm on GPU 0,1,2,3
# vss(VLM) on GPU 4,5,6,7
# embedding on GPU 4
# reranking on GPU 4
nim-llm:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "0,1,2,3"
- name: NIM_MAX_MODEL_LEN
value: "128000"
resources:
limits:
nvidia.com/gpu: 0 # no limit
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
env:
- name: VLM_MODEL_TO_USE
value: vila-1.5 # Or "openai-compat" or "custom" or "nvila"
- name: MODEL_PATH
value: "ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8"
- name: DISABLE_GUARDRAILS
value: "false" # "true" to disable guardrails.
- name: TRT_LLM_MODE
value: "" # int4_awq (default), int8 or fp16. (for VILA only)
- name: VLM_BATCH_SIZE
value: "" # Default is determined based on GPU memory. (for VILA only)
- name: VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME
value: "" # Set to use a VLM exposed as a REST API with OpenAI compatible API (e.g. gpt-4o)
- name: VIA_VLM_ENDPOINT
value: "" # Default OpenAI API. Override to use a custom API
- name: VIA_VLM_API_KEY
value: "" # API key to set when calling VIA_VLM_ENDPOINT
- name: OPENAI_API_VERSION
value: ""
- name: AZURE_OPENAI_API_VERSION
value: ""
- name: NVIDIA_VISIBLE_DEVICES
value: "4,5,6,7"
resources:
limits:
nvidia.com/gpu: 0 # no limit
nemo-embedding:
applicationSpecs:
embedding-deployment:
containers:
embedding-container:
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
key: NGC_API_KEY
name: ngc-api-key-secret
- name: NVIDIA_VISIBLE_DEVICES
value: '4'
resources:
limits:
nvidia.com/gpu: 0 # no limit
nemo-rerank:
applicationSpecs:
ranking-deployment:
containers:
ranking-container:
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
key: NGC_API_KEY
name: ngc-api-key-secret
- name: NVIDIA_VISIBLE_DEVICES
value: '4'
resources:
limits:
nvidia.com/gpu: 0 # no limit
To apply the overrides file while deploying the VSS Helm Chart, run:
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.3.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml
Next, Wait for VSS to be ready and then Launch VSS UI.
Note
Limitation
It bypass the k8s device plugin resource allocation.
Helm chart might not work in managed k8s cluster services(e.g. AKS) where this env is not allowed.
For k8s where CDI is enabled this variable is ignored and GPUs are allocated randomly. So won’t work as expected.
Configuring GPU Allocation#
As mentioned earlier, the default helm chart deployment topology is configured for 8xGPUs with each GPU being used by a single service.
To customize the default helm deployment for various GPU configurations,
modify the NVIDIA_VISIBLE_DEVICES
environment variable for each of the services in the overrides.yaml
file shown below.
Additionally, nvidia.com/gpu: 0
must be set to disable GPU allocation by the GPU operator.
# 4 + 4 with GPU Sharing topology example
# in a host with 8xH100
# nim-llm on GPU 0,1,2,3
# vss(VLM) on GPU 4,5,6,7
# embedding on GPU 4
# reranking on GPU 4
nim-llm:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "0,1,2,3"
- name: NIM_MAX_MODEL_LEN
value: "128000"
resources:
limits:
nvidia.com/gpu: 0 # no limit
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
env:
- name: VLM_MODEL_TO_USE
value: vila-1.5 # Or "openai-compat" or "custom" or "nvila"
- name: MODEL_PATH
value: "ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8"
- name: DISABLE_GUARDRAILS
value: "false" # "true" to disable guardrails.
- name: TRT_LLM_MODE
value: "" # int4_awq (default), int8 or fp16. (for VILA only)
- name: VLM_BATCH_SIZE
value: "" # Default is determined based on GPU memory. (for VILA only)
- name: VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME
value: "" # Set to use a VLM exposed as a REST API with OpenAI compatible API (e.g. gpt-4o)
- name: VIA_VLM_ENDPOINT
value: "" # Default OpenAI API. Override to use a custom API
- name: VIA_VLM_API_KEY
value: "" # API key to set when calling VIA_VLM_ENDPOINT
- name: OPENAI_API_VERSION
value: ""
- name: AZURE_OPENAI_API_VERSION
value: ""
- name: NVIDIA_VISIBLE_DEVICES
value: "4,5,6,7"
resources:
limits:
nvidia.com/gpu: 0 # no limit
nemo-embedding:
applicationSpecs:
embedding-deployment:
containers:
embedding-container:
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
key: NGC_API_KEY
name: ngc-api-key-secret
- name: NVIDIA_VISIBLE_DEVICES
value: '4'
resources:
limits:
nvidia.com/gpu: 0 # no limit
nemo-rerank:
applicationSpecs:
ranking-deployment:
containers:
ranking-container:
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
key: NGC_API_KEY
name: ngc-api-key-secret
- name: NVIDIA_VISIBLE_DEVICES
value: '4'
resources:
limits:
nvidia.com/gpu: 0 # no limit
NVIDIA_VISIBLE_DEVICES
must be set based on:
The number of GPUs available on the system.
GPU requirements for each service.
When using VILA-1.5 VLM, VSS requires at least 1 GPU on an 80+ GB GPU (A100, H100, H200) and at least 2 GPUs on an 48 GB GPU (L40s).
When using NVILA VLM or a remote VLM endpoint, VSS requires at least 1 GPU (A100, H100, H200, L40s).
Embedding and Reranking require 1 GPU each but can share a GPU with VSS on an 80+ GB GPU.
Check NVIDIA NIM for Large Language Models (LLMs) documentation for LLM GPU requirements.
GPUs can be shared even further by using the low memory modes and smaller VLM and LLM models as shown in Fully Local Single GPU Deployment.
If using Using External Endpoints for any of the services, GPUs will not be used by these services and GPU requirements will be further reduced.
Fully Local Single GPU Deployment#
Single GPU Deployment recipe using non-default low memory modes and smaller LLMs verified on 1XH100, 1XH200, 1XA100 (80GB+, HBM) machine is available below.
This deployment downloads and runs the VLM, LLM, Embedding, and Reranker models locally on one single GPU.
The configuration:
Sets all services (VSS, LLM, embedding, reranking) to share GPU 0
Enables low memory mode and relaxed memory constraints for the LLM
Uses a smaller LLM model (llama-3.1-8b-instruct) suitable for single GPU deployment
Configures the VSS engine to use NVILA model for vision tasks
Sets appropriate init containers to ensure services start in the correct order
Note
CV and audio related features are currently not supported in Single GPU deployment.
The following overrides file can be used to deploy VSS Helm Chart on a single GPU.
nim-llm:
image:
repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
tag: 1.3.3
llmModel: meta/llama-3.1-8b-instruct
model:
name: meta/llama-3.1-8b-instruct
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "0"
- name: NIM_LOW_MEMORY_MODE
value: "1"
- name: NIM_RELAX_MEM_CONSTRAINTS
value: "1"
resources:
limits:
nvidia.com/gpu: 0 # no limit
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
image:
pullPolicy: IfNotPresent
repository: nvcr.io/nvidia/blueprint/vss-engine
tag: 2.3.0
env:
- name: VLM_MODEL_TO_USE
value: nvila # Or "openai-compat" or "custom" or "nvila"
- name: MODEL_PATH
value: "git:https://huggingface.co/Efficient-Large-Model/NVILA-15B"
- name: NVIDIA_VISIBLE_DEVICES
value: "0"
- name: CA_RAG_EMBEDDINGS_DIMENSION
value: "500"
initContainers:
- command:
- sh
- -c
- until nc -z -w 2 milvus-milvus-deployment-milvus-service 19530; do echo
waiting for milvus; sleep 2; done
image: busybox:1.28
imagePullPolicy: IfNotPresent
name: check-milvus-up
- command:
- sh
- -c
- until nc -z -w 2 neo-4-j-service 7687; do echo waiting for neo4j; sleep
2; done
image: busybox:1.28
imagePullPolicy: IfNotPresent
name: check-neo4j-up
- args:
- "while ! curl -s -f -o /dev/null http://nemo-embedding-embedding-deployment-embedding-service:8000/v1/health/live;\
\ do\n echo \"Waiting for nemo-embedding...\"\n sleep 2\ndone\n"
command:
- sh
- -c
image: curlimages/curl:latest
imagePullPolicy: IfNotPresent
name: check-nemo-embed-up
- args:
- "while ! curl -s -f -o /dev/null http://nemo-rerank-ranking-deployment-ranking-service:8000/v1/health/live;\
\ do\n echo \"Waiting for nemo-rerank...\"\n sleep 2\ndone\n"
command:
- sh
- -c
image: curlimages/curl:latest
imagePullPolicy: IfNotPresent
name: check-nemo-rerank-up
- args:
- "while ! curl -s -f -o /dev/null http://llm-nim-svc:8000/v1/health/live;\
\ do\n echo \"Waiting for LLM...\"\n sleep 2\ndone\n"
command:
- sh
- -c
image: curlimages/curl:latest
name: check-llm-up
llmModel: meta/llama-3.1-8b-instruct
llmModelChat: meta/llama-3.1-8b-instruct
configs:
ca_rag_config.yaml:
chat:
llm:
model: meta/llama-3.1-8b-instruct
notification:
llm:
model: meta/llama-3.1-8b-instruct
summarization:
llm:
model: meta/llama-3.1-8b-instruct
guardrails_config.yaml:
models:
- engine: nim
model: meta/llama-3.1-8b-instruct
parameters:
base_url: http://llm-nim-svc:8000/v1
type: main
- engine: nim
model: nvidia/llama-3.2-nv-embedqa-1b-v2
parameters:
base_url: http://nemo-embedding-embedding-deployment-embedding-service:8000/v1
type: embeddings
resources:
limits:
nvidia.com/gpu: 0 # no limit
nemo-embedding:
applicationSpecs:
embedding-deployment:
containers:
embedding-container:
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
key: NGC_API_KEY
name: ngc-api-key-secret
- name: NVIDIA_VISIBLE_DEVICES
value: '0'
- name: NIM_MODEL_PROFILE
value: "f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f" # model profile: fp16-onnx-onnx
resources:
limits:
nvidia.com/gpu: 0 # no limit
nemo-rerank:
applicationSpecs:
ranking-deployment:
containers:
ranking-container:
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
key: NGC_API_KEY
name: ngc-api-key-secret
- name: NVIDIA_VISIBLE_DEVICES
value: '0'
- name: NIM_MODEL_PROFILE
value: "f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f" # model profile: fp16-onnx-onnx
resources:
limits:
nvidia.com/gpu: 0 # no limit
To apply the overrides file while deploying the VSS Helm Chart, run:
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.3.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml
Next, Wait for VSS to be ready and then Launch VSS UI.
Enabling Audio#
The following overrides are required to enable audio in Summarization and Q&A:
# 4 + 3 + 1 with GPU Sharing topology example
# in a host with 8xH100
# nim-llm on GPU 0,1,2,3
# vss(VLM) on GPU 5,6,7
# embedding on GPU 4
# reranking on GPU 4
# riva ASR on GPU 4
nim-llm:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "0,1,2,3"
- name: NIM_MAX_MODEL_LEN
value: "128000"
resources:
limits:
nvidia.com/gpu: 0 # no limit
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
env:
- name: VLM_MODEL_TO_USE
value: vila-1.5 # Or "openai-compat" or "custom" or "nvila"
- name: MODEL_PATH
value: "ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8"
- name: INSTALL_PROPRIETARY_CODECS
value: "true"
- name: NVIDIA_VISIBLE_DEVICES
value: "5,6,7"
- name: ENABLE_AUDIO
value: "true"
- name: ENABLE_RIVA_SERVER_READINESS_CHECK
value: "true"
resources:
limits:
nvidia.com/gpu: 0 # no limit
nemo-embedding:
applicationSpecs:
embedding-deployment:
containers:
embedding-container:
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
key: NGC_API_KEY
name: ngc-api-key-secret
- name: NVIDIA_VISIBLE_DEVICES
value: '4'
resources:
limits:
nvidia.com/gpu: 0 # no limit
nemo-rerank:
applicationSpecs:
ranking-deployment:
containers:
ranking-container:
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
key: NGC_API_KEY
name: ngc-api-key-secret
- name: NVIDIA_VISIBLE_DEVICES
value: '4'
resources:
limits:
nvidia.com/gpu: 0 # no limit
riva:
enabled: true
applicationSpecs:
riva-deployment:
containers:
riva-container:
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
key: NGC_API_KEY
name: ngc-api-key-secret
- name: NIM_HTTP_API_PORT
value: '9000'
- name: NIM_GRPC_API_PORT
value: '50051'
- name: NIM_TAGS_SELECTOR
value: name=parakeet-0-6b-ctc-riva-en-us,mode=all
- name: NVIDIA_VISIBLE_DEVICES
value: '4'
resources:
limits:
nvidia.com/gpu: 0
Things to note in the above overrides file:
The overrides file assumes a specific GPU topology (documented in the beginning of the overrides file)
The setting
nvidia.com/gpu: 0
for all microservices is 0 (means no limit in GPU allocation).GPU allocation to each microservice is handled using
env
NVIDIA_VISIBLE_DEVICES
.riva
microservice is enabled and configured to use GPU 4 which is shared with VSS, embedding and reranking pods.riva
microservice is configured to useparakeet-0-6b-ctc-riva-en-us
model.VSS has audio enabled by setting
ENABLE_AUDIO
totrue
.ENABLE_RIVA_SERVER_READINESS_CHECK
is set totrue
. This enables readiness check for Riva ASR server at VSS startup.VSS has proprietary codecs enabled by setting
INSTALL_PROPRIETARY_CODECS
totrue
. This will install additional opensource and proprietary codecs. Please review their license terms. This is required for additional audio codecs support.
Installing the additional codecs and opensource packages requires root permissions, this can be done by setting the following in the overrides file to run the container as root. Alternatively, a Custom Container Image with Codecs Installed can be used.
...
vss:
applicationSpecs:
vss-deployment:
securityContext:
fsGroup: 0
runAsGroup: 0
runAsUser: 0
...
Note
This has been tested with 8 x H100 GPUs.
To apply the overrides file while deploying the VSS Helm Chart, run:
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.3.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml
Next, Wait for VSS to be ready and then Launch VSS UI.
Enabling CV Pipeline: Set-Of-Marks (SOM) & Metadata#
The following overrides are required to enable the CV pipeline:
# 4 + 3 + 1 with GPU Sharing topology example
# in a host with 8xH100
# nim-llm on GPU 0,1,2,3
# vss(VLM) on GPU 5,6,7
# embedding on GPU 4
# reranking on GPU 4
nim-llm:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "0,1,2,3"
- name: NIM_MAX_MODEL_LEN
value: "128000"
resources:
limits:
nvidia.com/gpu: 0 # no limit
vss:
applicationSpecs:
vss-deployment:
securityContext:
fsGroup: 0
runAsGroup: 0
runAsUser: 0
containers:
vss:
env:
- name: VLM_MODEL_TO_USE
value: vila-1.5 # Or "openai-compat" or "custom" or "nvila"
- name: MODEL_PATH
value: "ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8"
- name: NVIDIA_VISIBLE_DEVICES
value: "5,6,7"
- name: INSTALL_PROPRIETARY_CODECS
value: "true"
- name: DISABLE_CV_PIPELINE
value: "false"
- name: GDINO_INFERENCE_INTERVAL
value: "1"
- name: NUM_CV_CHUNKS_PER_GPU
value: "2"
resources:
limits:
nvidia.com/gpu: 0 # no limit
nemo-embedding:
applicationSpecs:
embedding-deployment:
containers:
embedding-container:
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
key: NGC_API_KEY
name: ngc-api-key-secret
- name: NVIDIA_VISIBLE_DEVICES
value: '4'
resources:
limits:
nvidia.com/gpu: 0 # no limit
nemo-rerank:
applicationSpecs:
ranking-deployment:
containers:
ranking-container:
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
key: NGC_API_KEY
name: ngc-api-key-secret
- name: NVIDIA_VISIBLE_DEVICES
value: '4'
resources:
limits:
nvidia.com/gpu: 0 # no limit
Things to note in the above overrides file:
The overrides file assumes a specific GPU topology (documented in the beginning of the overrides file)
The setting
nvidia.com/gpu: 0
for all microservices is 0 (means no limit in GPU allocation).GPU allocation to each microservice is handled using
env
NVIDIA_VISIBLE_DEVICES
.VSS has CV pipeline enabled by setting
DISABLE_CV_PIPELINE
tofalse
. This will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.VSS has proprietary codecs enabled by setting
INSTALL_PROPRIETARY_CODECS
totrue
. This will install additional opensource and proprietary codecs. Please review their license terms. This is required for Set-Of-Marks overlay preview.
Installing the additional codecs and opensource packages requires root permissions, this can be done by setting the following in the overrides file to run the container as root. Alternatively, a Custom Container Image with Codecs Installed can be used.
...
vss:
applicationSpecs:
vss-deployment:
securityContext:
fsGroup: 0
runAsGroup: 0
runAsUser: 0
...
Note
This has been tested with 8 x H100 GPUs.
To apply the overrides file while deploying the VSS Helm Chart, run:
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.3.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml
Next, Wait for VSS to be ready and then Launch VSS UI.
Multi-Node Deployment#
Multi-node deployments can be used in case where more resources (e.g. GPUs) are required than available on a single node. For example, a 8 x LLM GPUs + 6 x VLM GPUs + 1 x Embedding + 1 x Reranking topology can be deployed on 2 8xH100 nodes.
While services can be distributed across multiple nodes, each individual service and its associated containers must run entirely on a single node that has sufficient resources. Services cannot be split across multiple nodes to utilize their combined resources.
By default, Kubernetes schedules the pods based on resource availability automatically.
To explicitly schedule a particular pod on a particular node, run the following command. This command schedules the embedding and reranking services on the second node.
sudo microk8s kubectl get node
helm install vss-blueprint nvidia-blueprint-vss-2.3.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret \
--set vss.nodeSelector."kubernetes\.io/hostname"="<Name of Node #2>" \
--set nemo-embedding.nodeSelector."kubernetes\.io/hostname"="<Name of Node #2>" \
--set nemo-rerank.nodeSelector."kubernetes\.io/hostname"="<Name of Node #2>"
This can be done for any pod by adding the nodeSelector to the overrides file.
...
<service-name>:
nodeSelector:
"kubernetes.io/hostname": "<Name of Node #2>"
...
Note
If you have issues with multi-node deployment, try the following:
Try setting
nodeSelector
on each service as shown above when deploying.Try deleting existing PVCs
sudo microk8s kubectl delete pvc model-store-vss-blueprint-0 vss-ngc-model-cache-pvc
before redeploying.