Deploy Using Helm#
Before proceeding, ensure all prerequisites have been met.
Create Required Secrets#
These secrets allow your Kubernetes applications to securely access NVIDIA resources and your database without hardcoding credentials in your application code or container images.
To deploy the secrets required by the VSS Blueprint:
Note
If using microk8s
, prepend the kubectl
commands with sudo microk8s
. For example, sudo microk8s kubectl ...
.
To join the group for admin access, avoid using sudo, and other information about microk8s
setup and usage, review: https://microk8s.io/docs/getting-started.
If not using microk8s
, you can use kubectl
directly. For example, kubectl get pod
.
# Export NGC_API_KEY
export NGC_API_KEY=<YOUR_LEGACY_NGC_API_KEY>
# Create credentials for pulling images from NGC (nvcr.io)
sudo microk8s kubectl create secret docker-registry ngc-docker-reg-secret \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password=$NGC_API_KEY
# Configure login information for Neo4j graph database
sudo microk8s kubectl create secret generic graph-db-creds-secret \
--from-literal=username=neo4j --from-literal=password=password
# Configure login information for ArangoDB graph database
# Note: Need to keep username as root for ArangoDB to work.
sudo microk8s kubectl create secret generic arango-db-creds-secret \
--from-literal=username=root --from-literal=password=password
# Configure login information for MinIO object storage
sudo microk8s kubectl create secret generic minio-creds-secret \
--from-literal=access-key=minio --from-literal=secret-key=minio123
# Configure the legacy NGC API key for downloading models from NGC
sudo microk8s kubectl create secret generic ngc-api-key-secret \
--from-literal=NGC_API_KEY=$NGC_API_KEY
Deploy the Helm Chart#
To deploy the VSS Blueprint Helm Chart:
# Fetch the VSS Blueprint Helm Chart
sudo microk8s helm fetch \
https://helm.ngc.nvidia.com/nvidia/blueprint/charts/nvidia-blueprint-vss-2.4.0.tgz \
--username='$oauthtoken' --password=$NGC_API_KEY
# Install the Helm Chart
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.4.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret
# For B200
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.4.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret \
--set nim-llm.profile=f17543bf1ee65e4a5c485385016927efe49cbc068a6021573d83eacb32537f76
# For H200
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.4.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret \
--set nim-llm.profile=99142c13a095af184ae20945a208a81fae8d650ac0fd91747b03148383f882cf
# For RTX Pro 6000 Blackwell
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.4.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret \
--set nim-llm.image.tag=1.13.1 \
--set nim-llm.profile=f51a862830b10eb7d0d2ba51184d176a0a37674fef85300e4922b924be304e2b
Note
Cosmos-Reason1 7b FP8 (default) is not supported on L40s. Use Cosmos-Reason1 7b FP16 instead by setting
MODEL_PATH
to git:https://huggingface.co/nvidia/Cosmos-Reason1-7B
in the Helm overrides file
as shown in Configuration Options.
Note
For more information on the LLM NIM version and profiles, refer to NIM Model Profile Optimization.
Note
When running on an L40 or L40S system, the default startup probe timeout might not be enough for
the VILA model to be downloaded and its TRT-LLM engine to be built. To prevent startup timeout, increase
the startup timeout by passing --set vss.applicationSpecs.vss-deployment.containers.vss.startupProbe.failureThreshold=360
to the Helm install command or the following to the overrides file.
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
startupProbe:
failureThreshold: 360
Note
This project downloads and installs additional third-party open source software projects. Review the license terms of these open source projects before use.
Audio and CV metadata are not enabled by default. To enable them, refer to Enabling Audio and Enabling CV Pipeline: Set-of-Marks (SOM) and Metadata.
Check FAQ page for deployment failure scenarios with corresponding troubleshooting instructions and commands.
Wait for all services to be up. This can take some time (a few minutes to up to an hour) depending on the setup and configuration. Typically, deploying a second time onwards is faster because the models are cached. Ensure all pods are in Running or Completed STATUS and show 1/1 as READY. You can monitor the services using the following command:
sudo watch -n1 microk8s kubectl get pod
watch
refreshes the output every second. Wait for all pods to be in Running or Completed STATUS
and show 1/1 as READY.
To ensure the VSS UI is ready and accessible, check logs for deployment using command:
sudo microk8s kubectl logs -l app.kubernetes.io/name=vss
Verify that the following logs are present and that you do not observe errors:
Application startup complete.
Uvicorn running on http://0.0.0.0:9000
If a lot of time has passed since VSS started, kubectl logs
might have cleared older logs. In this case look for:
INFO: 10.78.15.132:48016 - "GET /health/ready HTTP/1.1" 200 OK
INFO: 10.78.15.132:50386 - "GET /health/ready HTTP/1.1" 200 OK
INFO: 10.78.15.132:50388 - "GET /health/live HTTP/1.1" 200 OK
Uninstalling the Deployment#
To uninstall the deployment, run the following command:
sudo microk8s helm uninstall vss-blueprint
Default Deployment Topology and Models in Use#
The default deployment topology is as follows.
This is the topology that you observe when checking deployment status using sudo microk8s kubectl get pod
:
Microservice/Pod |
Description |
Default #GPU Allocation |
---|---|---|
vss-blueprint-0 |
The NIM LLM (llama-3.1). |
4 |
vss-vss-deployment |
VSS Ingestion Pipeline (VLM: default is Cosmos-Reason1) + Retrieval Pipeline (CA-RAG) |
2 |
nemo-embedding-embedding-deployment |
NeMo Embedding model used in Retrieval Pipeline |
1 |
nemo-rerank-ranking-deployment |
NeMo Reranking model used in Retrieval Pipeline |
1 |
etcd, milvus, neo4j, minio, arango-db, elastic-search |
Various databases, data stores and supporting services |
NA |
Launch VSS UI#
Follow these steps to launch the VSS UI:
Find the service ports:
Run the following command to get the service ports. The output of the command can vary for your setup but it should look similar to this:
sudo microk8s kubectl get svc vss-service # Example output: vss-service NodePort <CLUSTER_IP> <none> 8000:32114/TCP,9000:32206/TCP 12m
Identify the NodePorts:
Using the output, identify the NodePorts:
Port 8000 corresponds to the REST API (VSS_API_ENDPOINT); this is mapped to machine’s port 32114
Port 9000 corresponds to the UI; this is mapped to machine’s port 32206
Access the VSS UI:
Open your browser and navigate to
http://<NODE_IP>:32206
. Optionally, the VSS REST API is available athttp://<NODE_IP>:32114
Note
The <NODE_IP> is the IP address of the machine where the Helm Chart or pod by the name:
vss-vss-deployment
is deployed.Test the deployment by summarizing a sample video.
Continue to Configuration Options and VSS Customization when you are ready to customize VSS.
If you run into any errors, refer to the sections FAQ and Known Issues.
Configuration Options#
Some options in the Helm Chart can be configured using the Helm overrides file.
These options include:
VSS deployment time configurations. More info: VSS Deployment-Time Configuration Glossary.
Changing the default models in the VSS Helm Chart. More info: Plug-and-Play Overview.
An example of the overrides.yaml
file:
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
# Update to override with custom VSS image
# Set imagePullSecrets if the custom image is hosted on a private registry
image:
repository: nvcr.io/nvidia/blueprint/vss-engine
tag: 2.4.0
env:
- name: VLM_MODEL_TO_USE
value: cosmos-reason1 # Or "openai-compat" or "custom" or "nvila" or "vila-1.5"
# Specify path in case of VILA-1.5 / NVILA / Cosmos-Reason1 and custom model. Can be either
# a NGC resource path or a local path. For custom models this
# must be a path to the directory containing "inference.py" and
# "manifest.yaml" files.
- name: MODEL_PATH
value: "ngc:nim/nvidia/cosmos-reason1-7b:1.1-fp8-dynamic"
- name: LLM_MODEL
value: meta/llama-3.1-70b-instruct
#- name: VILA_ENGINE_NGC_RESOURCE # Enable to use prebuilt engines from NGC
# value: "nvidia/blueprint/vss-vlm-prebuilt-engine:2.3.0-vila-1.5-40b-h100-sxm"
# - name: DISABLE_GUARDRAILS
# value: "false" # "true" to disable guardrails.
# - name: TRT_LLM_MODE
# value: "" # int4_awq (default), int8 or fp16. (for VILA only)
# - name: VLM_BATCH_SIZE
# value: "" # Default is determined based on GPU memory. (for VILA only)
# - name: VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME
# value: "" # Set to use a VLM exposed as a REST API with OpenAI compatible API (e.g. gpt-4o)
# - name: VIA_VLM_ENDPOINT
# value: "" # Default OpenAI API. Override to use a custom API
# - name: VIA_VLM_API_KEY
# value: "" # API key to set when calling VIA_VLM_ENDPOINT. Can be set from a secret.
# - name: OPENAI_API_VERSION
# value: ""
# - name: AZURE_OPENAI_ENDPOINT
# value: ""
# - name: AZURE_OPENAI_API_VERSION
# value: ""
# - name: VSS_LOG_LEVEL
# value: "info"
# - name: VSS_EXTRA_ARGS
# value: ""
# - name: INSTALL_PROPRIETARY_CODECS
# value: "true" # Requires root permissions in the container.
# - name: DISABLE_FRONTEND
# value: "false"
# - name: DISABLE_CA_RAG
# value: "false"
# - name: VLM_BATCH_SIZE # Applicable only to VILA-1.5 and NVILA models
# value: ""
# - name: ENABLE_VIA_HEALTH_EVAL
# value: "true"
# - name: ENABLE_DENSE_CAPTION
# value: "true"
# - name: VSS_DISABLE_LIVESTREAM_PREVIEW
# value: "1"
# - name: VSS_SKIP_INPUT_MEDIA_VERIFICATION
# value: "1"
# - name: VSS_RTSP_LATENCY
# value: "2000"
# - name: VSS_RTSP_TIMEOUT
# value: "2000"
# - name: VLM_DEFAULT_NUM_FRAMES_PER_CHUNK
# value: "8"
# - name: VLLM_GPU_MEMORY_UTILIZATION
# value: "0.4"
# - name: VLM_SYSTEM_PROMPT
# value: "You are a helpful assistant. Answer the users question."
# - name: ALERT_REVIEW_DEFAULT_VLM_SYSTEM_PROMPT
# value: "You are a helpful assistant. Answer the users question with 'yes' or 'no'."
# - name: CONTEXT_MANAGER_CALL_TIMEOUT
# value: "3600"
resources:
limits:
nvidia.com/gpu: 2 # Set to 8 for 2 x 8H100 node deployment
# nodeSelector:
# kubernetes.io/hostname: <node-1>
# imagePullSecrets:
# - name: <imagePullSecretName>
nim-llm:
resources:
limits:
nvidia.com/gpu: 4
# nodeSelector:
# kubernetes.io/hostname: <node-2>
nemo-embedding:
resources:
limits:
nvidia.com/gpu: 1 # Set to 2 for 2 x 8H100 node deployment
# nodeSelector:
# kubernetes.io/hostname: <node-2>
nemo-rerank:
resources:
limits:
nvidia.com/gpu: 1 # Set to 2 for 2 x 8H100 node deployment
# nodeSelector:
# kubernetes.io/hostname: <node-2>
Note
The overrides.yaml
file must be created. The Helm chart package does not include it.
The overrides.yaml
file provided by default has nvidia.com/gpu
limits set for each service to match the default Helm chart deployment topology.
Change this limit as needed by editing the value of nvidia.com/gpu
in the YAML file for resources / limits
section for each of the services, for example, viz, vss, nim-llm, nemo-embedding, and nemo-rerank.
To apply the overrides file while deploying the VSS Helm chart, run:
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.4.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml
Note
For a list of all configuration options, observe VSS Deployment-Time Configuration Glossary.
Optional Deployment Topology with GPU Sharing#
You can achieve the following deployment topology by applying the Helm overrides file:
Note
This is only for B200, H100, H200, RTX PRO 6000 Blackwell SE, and A100 (80+ GB device memory).
Microservice/Pod |
Description |
Default #GPU Allocation |
---|---|---|
vss-blueprint-0 |
The NIM LLM (llama-3.1). |
4 (index: 0,1,2,3) |
vss-vss-deployment |
VSS Ingestion Pipeline (VLM: default is Cosmos-Reason1) + Retrieval Pipeline (CA-RAG) |
4 (index: 4,5,6,7) |
nemo-embedding-embedding-deployment |
NeMo Embedding model used in Retrieval Pipeline |
1 (index: 4) |
nemo-rerank-ranking-deployment |
NeMo Reranking model used in Retrieval Pipeline |
1 (index: 4) |
etcd, milvus, neo4j, minio, arango-db, elastic-search |
Various databases, data stores and supporting services |
NA |
The Helm overrides.yaml
file to change deployment topology with GPU sharing.
# 4 + 4 with GPU Sharing topology example
# in a host with 8xH100
# nim-llm on GPU 0,1,2,3
# vss(VLM) on GPU 4,5,6,7
# embedding on GPU 4
# reranking on GPU 4
nim-llm:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "0,1,2,3"
- name: NIM_MAX_MODEL_LEN
value: "128000"
resources:
limits:
nvidia.com/gpu: 0 # no limit
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
env:
- name: VLM_MODEL_TO_USE
value: cosmos-reason1 # Or "openai-compat" or "custom" or "nvila" or "vila-1.5"
- name: MODEL_PATH
value: "ngc:nim/nvidia/cosmos-reason1-7b:1.1-fp8-dynamic"
- name: LLM_MODEL
value: meta/llama-3.1-70b-instruct
- name: DISABLE_GUARDRAILS
value: "false" # "true" to disable guardrails.
- name: TRT_LLM_MODE
value: "" # int4_awq (default), int8 or fp16. (for VILA only)
- name: VLM_BATCH_SIZE
value: "" # Default is determined based on GPU memory. (for VILA only)
- name: VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME
value: "" # Set to use a VLM exposed as a REST API with OpenAI compatible API (e.g. gpt-4o)
- name: VIA_VLM_ENDPOINT
value: "" # Default OpenAI API. Override to use a custom API
- name: VIA_VLM_API_KEY
value: "" # API key to set when calling VIA_VLM_ENDPOINT
- name: OPENAI_API_VERSION
value: ""
- name: AZURE_OPENAI_API_VERSION
value: ""
- name: NVIDIA_VISIBLE_DEVICES
value: "4,5,6,7"
resources:
limits:
nvidia.com/gpu: 0 # no limit
nemo-embedding:
applicationSpecs:
embedding-deployment:
containers:
embedding-container:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: '4'
resources:
limits:
nvidia.com/gpu: 0 # no limit
nemo-rerank:
applicationSpecs:
ranking-deployment:
containers:
ranking-container:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: '4'
resources:
limits:
nvidia.com/gpu: 0 # no limit
To apply the overrides file while deploying the VSS Helm Chart, run:
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.4.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml
Wait for VSS to be ready and then Launch VSS UI.
Note
Limitations
It bypasses the Kubernetes device plugin resource allocation.
Helm chart might not work in managed Kubernetes cluster services (for example, AKS) where this env is not allowed.
For Kubernetes where CDI is enabled this variable is ignored and GPUs are allocated randomly. So it won’t work as expected.
Configuring GPU Allocation#
The default Helm chart deployment topology is configured for 8xGPUs with each GPU being used by a single service.
To customize the default Helm deployment for various GPU configurations,
modify the NVIDIA_VISIBLE_DEVICES
environment variable for each of the services in the overrides.yaml
file shown below.
Additionally, nvidia.com/gpu: 0
must be set to disable GPU allocation by the GPU operator.
# 4 + 4 with GPU Sharing topology example
# in a host with 8xH100
# nim-llm on GPU 0,1,2,3
# vss(VLM) on GPU 4,5,6,7
# embedding on GPU 4
# reranking on GPU 4
nim-llm:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "0,1,2,3"
- name: NIM_MAX_MODEL_LEN
value: "128000"
resources:
limits:
nvidia.com/gpu: 0 # no limit
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
env:
- name: VLM_MODEL_TO_USE
value: cosmos-reason1 # Or "openai-compat" or "custom" or "nvila" or "vila-1.5"
- name: MODEL_PATH
value: "ngc:nim/nvidia/cosmos-reason1-7b:1.1-fp8-dynamic"
- name: LLM_MODEL
value: meta/llama-3.1-70b-instruct
- name: DISABLE_GUARDRAILS
value: "false" # "true" to disable guardrails.
- name: TRT_LLM_MODE
value: "" # int4_awq (default), int8 or fp16. (for VILA only)
- name: VLM_BATCH_SIZE
value: "" # Default is determined based on GPU memory. (for VILA only)
- name: VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME
value: "" # Set to use a VLM exposed as a REST API with OpenAI compatible API (e.g. gpt-4o)
- name: VIA_VLM_ENDPOINT
value: "" # Default OpenAI API. Override to use a custom API
- name: VIA_VLM_API_KEY
value: "" # API key to set when calling VIA_VLM_ENDPOINT
- name: OPENAI_API_VERSION
value: ""
- name: AZURE_OPENAI_API_VERSION
value: ""
- name: NVIDIA_VISIBLE_DEVICES
value: "4,5,6,7"
resources:
limits:
nvidia.com/gpu: 0 # no limit
nemo-embedding:
applicationSpecs:
embedding-deployment:
containers:
embedding-container:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: '4'
resources:
limits:
nvidia.com/gpu: 0 # no limit
nemo-rerank:
applicationSpecs:
ranking-deployment:
containers:
ranking-container:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: '4'
resources:
limits:
nvidia.com/gpu: 0 # no limit
NVIDIA_VISIBLE_DEVICES
must be set based on:
The number of GPUs available on the system.
GPU requirements for each service.
When using VILA-1.5 VLM, VSS requires at least 1 GPU on an 80+ GB GPU (A100, H100, H200, B200, RTX PRO 6000 Blackwell SE) and at least 2 GPUs on a 48 GB GPU (L40s).
When using Cosmos-Reason1 or NVILA VLM or a remote VLM endpoint, VSS requires at least 1 GPU (A100, H100, H200, B200, RTX PRO 6000 Blackwell SE, L40s).
Embedding and Reranking require 1 GPU each but can share a GPU with VSS on an 80+ GB GPU.
RIVA ASR requires 1 GPU but can share a GPU with Embedding and Reranking on an 80+ GB GPU.
Check NVIDIA NIM for Large Language Models (LLMs) documentation for LLM GPU requirements.
GPUs can be shared even further by using the low memory modes and smaller VLM and LLM models as shown in Fully Local Single GPU Deployment.
If using Using External Endpoints for any of the services, GPUs will not be used by these services and GPU requirements will be further reduced.
Note
For optimal performance on specific hardware platforms, consider using hardware-optimized NIM model profiles. Profiles can vary depending on the number of GPUs and the hardware platform. See NIM Model Profile Optimization for detailed guidance on profile selection and configuration. IMPORTANT: RTX PRO 6000 users MUST use NIM version 1.13.1 and specific profile for llama-3.1-70b-instruct - this is mandatory, not optional.
Fully Local Single GPU Deployment#
Single GPU Deployment recipe using non-default low memory modes and smaller LLMs verified on 1xH100, 1xH200, 1xA100 (80GB+, HBM), 1xB200, 1xRTX PRO 6000 Blackwell SE machine is available below.
This deployment downloads and runs the VLM, LLM, Embedding, and Reranker models locally on one single GPU.
The configuration:
Sets all services (VSS, LLM, embedding, reranking) to share GPU 0
Enables low memory mode and relaxed memory constraints for the LLM
Uses a smaller LLM model (llama-3.1-8b-instruct) suitable for single GPU deployment
Configures the VSS engine to use Cosmos-Reason1 model for vision tasks
Sets appropriate init containers to ensure services start in the correct order
Note
CV and audio related features are currently not supported in Single GPU deployment.
The following overrides file can be used to deploy VSS Helm chart on a single GPU.
nim-llm:
image:
repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
tag: 1.12.0
llmModel: meta/llama-3.1-8b-instruct
model:
name: meta/llama-3.1-8b-instruct
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "0"
- name: NIM_LOW_MEMORY_MODE
value: "1"
- name: NIM_RELAX_MEM_CONSTRAINTS
value: "1"
resources:
limits:
nvidia.com/gpu: 0 # no limit
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
image:
pullPolicy: IfNotPresent
repository: nvcr.io/nvidia/blueprint/vss-engine
tag: 2.4.0
env:
- name: VLM_MODEL_TO_USE
value: cosmos-reason1 # Or "openai-compat" or "custom" or "nvila"
- name: LLM_MODEL
value: meta/llama-3.1-8b-instruct
- name: MODEL_PATH
value: "ngc:nim/nvidia/cosmos-reason1-7b:1.1-fp8-dynamic"
- name: NVIDIA_VISIBLE_DEVICES
value: "0"
- name: CA_RAG_EMBEDDINGS_DIMENSION
value: "500"
- name: VLM_BATCH_SIZE
value: "32"
- name: VLLM_GPU_MEMORY_UTILIZATION
value: "0.3"
- name: DISABLE_GUARDRAILS
value: "true"
initContainers:
- command:
- sh
- -c
- until nc -z -w 2 milvus-milvus-deployment-milvus-service 19530; do echo
waiting for milvus; sleep 2; done
image: busybox:1.28
imagePullPolicy: IfNotPresent
name: check-milvus-up
- command:
- sh
- -c
- until nc -z -w 2 neo-4-j-service 7687; do echo waiting for neo4j; sleep
2; done
image: busybox:1.28
imagePullPolicy: IfNotPresent
name: check-neo4j-up
- args:
- "while ! curl -s -f -o /dev/null http://nemo-embedding-embedding-deployment-embedding-service:8000/v1/health/live;\
\ do\n echo \"Waiting for nemo-embedding...\"\n sleep 2\ndone\n"
command:
- sh
- -c
image: curlimages/curl:latest
imagePullPolicy: IfNotPresent
name: check-nemo-embed-up
- args:
- "while ! curl -s -f -o /dev/null http://nemo-rerank-ranking-deployment-ranking-service:8000/v1/health/live;\
\ do\n echo \"Waiting for nemo-rerank...\"\n sleep 2\ndone\n"
command:
- sh
- -c
image: curlimages/curl:latest
imagePullPolicy: IfNotPresent
name: check-nemo-rerank-up
- args:
- "while ! curl -s -f -o /dev/null http://llm-nim-svc:8000/v1/health/live;\
\ do\n echo \"Waiting for LLM...\"\n sleep 2\ndone\n"
command:
- sh
- -c
image: curlimages/curl:latest
name: check-llm-up
llmModel: meta/llama-3.1-8b-instruct
llmModelChat: meta/llama-3.1-8b-instruct
configs:
ca_rag_config.yaml:
tools:
summarization_llm:
type: llm
params:
model: meta/llama-3.1-8b-instruct
chat_llm:
type: llm
params:
model: meta/llama-3.1-8b-instruct
notification_llm:
type: llm
params:
model: meta/llama-3.1-8b-instruct
guardrails_config.yaml:
models:
- engine: nim
model: meta/llama-3.1-8b-instruct
parameters:
base_url: http://llm-nim-svc:8000/v1
type: main
- engine: nim
model: nvidia/llama-3.2-nv-embedqa-1b-v2
parameters:
base_url: http://nemo-embedding-embedding-deployment-embedding-service:8000/v1
type: embeddings
resources:
limits:
nvidia.com/gpu: 0 # no limit
nemo-embedding:
applicationSpecs:
embedding-deployment:
containers:
embedding-container:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: '0'
- name: NIM_MODEL_PROFILE
value: "f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f" # model profile: fp16-onnx-onnx
resources:
limits:
nvidia.com/gpu: 0 # no limit
nemo-rerank:
applicationSpecs:
ranking-deployment:
containers:
ranking-container:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: '0'
- name: NIM_MODEL_PROFILE
value: "f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f" # model profile: fp16-onnx-onnx
resources:
limits:
nvidia.com/gpu: 0 # no limit
Note
Guardrails has been disabled for single GPU deployment because of accuracy issues with the llama-3.1-8b-instruct
model. If required, it can be enabled by
removing the DISABLE_GUARDRAILS
environment variable from the overrides file.
To apply the overrides file while deploying the VSS Helm chart, run:
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.4.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml
Note
For more information on the LLM NIM version and profiles, refer to NIM Model Profile Optimization.
Wait for VSS to be ready and then Launch VSS UI.
Enabling Audio#
The following overrides are required to enable audio in Summarization and Q&A:
# 4 + 3 + 1 with GPU Sharing topology example
# in a host with 8xH100
# nim-llm on GPU 0,1,2,3
# vss(VLM) on GPU 5,6,7
# embedding on GPU 4
# reranking on GPU 4
# riva ASR on GPU 4
nim-llm:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "0,1,2,3"
- name: NIM_MAX_MODEL_LEN
value: "128000"
resources:
limits:
nvidia.com/gpu: 0 # no limit
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
applicationSpecs:
vss-deployment:
securityContext:
fsGroup: 0
runAsGroup: 0
runAsUser: 0
env:
- name: VLM_MODEL_TO_USE
value: cosmos-reason1 # Or "openai-compat" or "custom" or "nvila" or "vila-1.5"
- name: MODEL_PATH
value: "ngc:nim/nvidia/cosmos-reason1-7b:1.1-fp8-dynamic"
- name: LLM_MODEL
value: meta/llama-3.1-70b-instruct
- name: INSTALL_PROPRIETARY_CODECS
value: "true"
- name: NVIDIA_VISIBLE_DEVICES
value: "5,6,7"
- name: ENABLE_AUDIO
value: "true"
- name: ENABLE_RIVA_SERVER_READINESS_CHECK
value: "true"
resources:
limits:
nvidia.com/gpu: 0 # no limit
nemo-embedding:
applicationSpecs:
embedding-deployment:
containers:
embedding-container:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: '4'
resources:
limits:
nvidia.com/gpu: 0 # no limit
nemo-rerank:
applicationSpecs:
ranking-deployment:
containers:
ranking-container:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: '4'
resources:
limits:
nvidia.com/gpu: 0 # no limit
riva:
enabled: true
applicationSpecs:
riva-deployment:
containers:
riva-container:
env:
- name: NIM_TAGS_SELECTOR
value: name=parakeet-0-6b-ctc-riva-en-us,mode=all
- name: NVIDIA_VISIBLE_DEVICES
value: '4'
resources:
limits:
nvidia.com/gpu: 0
Things to note in the above overrides file:
The overrides file assumes a specific GPU topology (documented in the beginning of the overrides file).
The setting
nvidia.com/gpu: 0
for all microservices is 0 (means no limit in GPU allocation).GPU allocation to each microservice is handled using
env
NVIDIA_VISIBLE_DEVICES
.riva
microservice is enabled and configured to use GPU 4, which is shared with VSS, embedding and reranking pods.riva
microservice is configured to useparakeet-0-6b-ctc-riva-en-us
model.VSS has audio enabled by setting
ENABLE_AUDIO
totrue
.ENABLE_RIVA_SERVER_READINESS_CHECK
is set totrue
. This enables readiness check for Riva ASR server at VSS startup.VSS has proprietary codecs enabled by setting
INSTALL_PROPRIETARY_CODECS
totrue
. This will install additional open source and proprietary codecs. Review their license terms. This is required for additional audio codecs support.
Installing the additional codecs and open source packages requires root permissions, this can be done by setting the following in the overrides file to run the container as root. Alternatively, a Custom Container Image with Codecs Installed can be used.
...
vss:
applicationSpecs:
vss-deployment:
securityContext:
fsGroup: 0
runAsGroup: 0
runAsUser: 0
...
Note
This has been tested with 8 x H100 GPUs.
To apply the overrides file while deploying the VSS Helm chart, run:
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.4.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml
Wait for VSS to be ready and then Launch VSS UI.
Enabling CV Pipeline: Set-of-Marks (SOM) and Metadata#
The following overrides are required to enable the CV pipeline:
# 4 + 3 + 1 with GPU Sharing topology example
# in a host with 8xH100
# nim-llm on GPU 0,1,2,3
# vss(VLM) on GPU 5,6,7
# embedding on GPU 4
# reranking on GPU 4
nim-llm:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "0,1,2,3"
- name: NIM_MAX_MODEL_LEN
value: "128000"
resources:
limits:
nvidia.com/gpu: 0 # no limit
vss:
applicationSpecs:
vss-deployment:
securityContext:
fsGroup: 0
runAsGroup: 0
runAsUser: 0
containers:
vss:
env:
- name: VLM_MODEL_TO_USE
value: cosmos-reason1 # Or "openai-compat" or "custom" or "nvila" or "vila-1.5"
- name: MODEL_PATH
value: "ngc:nim/nvidia/cosmos-reason1-7b:1.1-fp8-dynamic"
- name: LLM_MODEL
value: meta/llama-3.1-70b-instruct
- name: NVIDIA_VISIBLE_DEVICES
value: "5,6,7"
- name: INSTALL_PROPRIETARY_CODECS
value: "true"
- name: DISABLE_CV_PIPELINE
value: "false"
- name: GDINO_INFERENCE_INTERVAL
value: "1"
- name: NUM_CV_CHUNKS_PER_GPU
value: "2"
resources:
limits:
nvidia.com/gpu: 0 # no limit
nemo-embedding:
applicationSpecs:
embedding-deployment:
containers:
embedding-container:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: '4'
resources:
limits:
nvidia.com/gpu: 0 # no limit
nemo-rerank:
applicationSpecs:
ranking-deployment:
containers:
ranking-container:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: '4'
resources:
limits:
nvidia.com/gpu: 0 # no limit
Things to note in the above overrides file:
The overrides file assumes a specific GPU topology (documented in the beginning of the overrides file).
The
NUM_CV_CHUNKS_PER_GPU
is set to2
. This should be lowered to1
for lower memory GPUs like L40s.The setting
nvidia.com/gpu: 0
for all microservices is 0 (means no limit in GPU allocation).GPU allocation to each microservice is handled using
env
NVIDIA_VISIBLE_DEVICES
.VSS has CV pipeline enabled by setting
DISABLE_CV_PIPELINE
tofalse
. This will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.VSS has proprietary codecs enabled by setting
INSTALL_PROPRIETARY_CODECS
totrue
. This will install additional open source and proprietary codecs. Review their license terms. This is required for Set-of-Marks overlay preview.
Installing the additional codecs and open source packages requires root permissions, this can be done by setting the following in the overrides file to run the container as root. Alternatively, a Custom Container Image with Codecs Installed can be used.
...
vss:
applicationSpecs:
vss-deployment:
securityContext:
fsGroup: 0
runAsGroup: 0
runAsUser: 0
...
Note
This has been tested with 8 x H100 GPUs.
To apply the overrides file while deploying the VSS Helm chart, run:
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.4.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml
Next, Wait for VSS to be ready and then Launch VSS UI.
Configuring CA-RAG Configuration#
To customize the CA-RAG configuration, modify the ca_rag_config.yaml
section in the overrides file.
More information on the CA-RAG configuration can be found in CA-RAG Configuration.
Note: The endpoints for models are already configured to use the models deployed as part of the Helm Chart.
Here’s an example of the CA-RAG configuration overrides ingestion and retriever functions to vector-rag and elasticsearchDB.
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
env:
- name: VLM_MODEL_TO_USE
value: cosmos-reason1
- name: LLM_MODEL
value: meta/llama-3.1-70b-instruct
configs:
ca_rag_config.yaml:
functions:
ingestion_function:
type: vector_ingestion
tools:
db: elasticsearch_db
retriever_function:
type: vector_retrieval
tools:
db: elasticsearch_db
reranker: nvidia_reranker
tools:
elasticsearch_db:
type: elasticsearch
params:
host: ${ES_HOST}
port: ${ES_PORT}
tools:
embedding: nvidia_embedding
resources:
limits:
nvidia.com/gpu: 2
nim-llm:
resources:
limits:
nvidia.com/gpu: 4
nemo-embedding:
resources:
limits:
nvidia.com/gpu: 1
nemo-rerank:
resources:
limits:
nvidia.com/gpu: 1
Multi-Node Deployment#
Multi-node deployments can be used in cases where more resources (for example, GPUs) are required than available on a single node. For example, an eight-GPU LLM, six-GPU VLM, one-GPU Embedding, and one-GPU Reranking topology can be deployed on two 8xH100 nodes.
While services can be distributed across multiple nodes, each individual service and its associated containers must run entirely on a single node that has sufficient resources. Services cannot be split across multiple nodes to utilize their combined resources.
By default, Kubernetes schedules the pods based on resource availability automatically.
To explicitly schedule a particular pod on a particular node, run the following command. This command schedules the embedding and reranking services on the second node.
sudo microk8s kubectl get node
helm install vss-blueprint nvidia-blueprint-vss-2.4.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret \
--set vss.nodeSelector."kubernetes\.io/hostname"="<Name of Node #2>" \
--set nemo-embedding.nodeSelector."kubernetes\.io/hostname"="<Name of Node #2>" \
--set nemo-rerank.nodeSelector."kubernetes\.io/hostname"="<Name of Node #2>"
This can be done for any pod by adding the nodeSelector
to the overrides file.
...
<service-name>:
nodeSelector:
"kubernetes.io/hostname": "<Name of Node #2>"
...
Note
If you have issues with multi-node deployment, try the following:
Try setting
nodeSelector
on each service as shown above when deploying.Try deleting existing PVCs
sudo microk8s kubectl delete pvc model-store-vss-blueprint-0 vss-ngc-model-cache-pvc
before redeploying.