Deploy Using Helm#

Before proceeding, ensure all prerequisites have been met.

Create Required Secrets#

These secrets allow your Kubernetes applications to securely access NVIDIA resources and your database without hardcoding credentials in your application code or container images.

To deploy the secrets required by the VSS Blueprint:

Note

If using microk8s, prepend the kubectl commands with sudo microk8s. For example, sudo microk8s kubectl .... To join the group for admin access, avoid using sudo, and other information about microk8s setup and usage, review: https://microk8s.io/docs/getting-started. If not using microk8s, you can use kubectl directly. For example, kubectl get pod.

# Export NGC_API_KEY

export NGC_API_KEY=<YOUR_LEGACY_NGC_API_KEY>

# Create credentials for pulling images from NGC (nvcr.io)

sudo microk8s kubectl create secret docker-registry ngc-docker-reg-secret \
    --docker-server=nvcr.io \
    --docker-username='$oauthtoken' \
    --docker-password=$NGC_API_KEY

# Configure login information for Neo4j graph database

sudo microk8s kubectl create secret generic graph-db-creds-secret \
    --from-literal=username=neo4j --from-literal=password=password

# Configure login information for ArangoDB graph database
# Note: Need to keep username as root for ArangoDB to work.

sudo microk8s kubectl create secret generic arango-db-creds-secret \
    --from-literal=username=root --from-literal=password=password

# Configure login information for MinIO object storage

sudo microk8s kubectl create secret generic minio-creds-secret \
    --from-literal=access-key=minio --from-literal=secret-key=minio123

# Configure the legacy NGC API key for downloading models from NGC

sudo microk8s kubectl create secret generic ngc-api-key-secret \
--from-literal=NGC_API_KEY=$NGC_API_KEY

Deploy the Helm Chart#

To deploy the VSS Blueprint Helm Chart:

# Fetch the VSS Blueprint Helm Chart

sudo microk8s helm fetch \
    https://helm.ngc.nvidia.com/nvidia/blueprint/charts/nvidia-blueprint-vss-2.4.0.tgz \
    --username='$oauthtoken' --password=$NGC_API_KEY

# Install the Helm Chart

sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.4.0.tgz \
    --set global.ngcImagePullSecretName=ngc-docker-reg-secret

# For B200
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.4.0.tgz \
    --set global.ngcImagePullSecretName=ngc-docker-reg-secret \
    --set nim-llm.profile=f17543bf1ee65e4a5c485385016927efe49cbc068a6021573d83eacb32537f76

# For H200
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.4.0.tgz \
    --set global.ngcImagePullSecretName=ngc-docker-reg-secret \
    --set nim-llm.profile=99142c13a095af184ae20945a208a81fae8d650ac0fd91747b03148383f882cf

# For RTX Pro 6000 Blackwell
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.4.0.tgz \
    --set global.ngcImagePullSecretName=ngc-docker-reg-secret \
    --set nim-llm.image.tag=1.13.1 \
    --set nim-llm.profile=f51a862830b10eb7d0d2ba51184d176a0a37674fef85300e4922b924be304e2b

Note

Cosmos-Reason1 7b FP8 (default) is not supported on L40s. Use Cosmos-Reason1 7b FP16 instead by setting MODEL_PATH to git:https://huggingface.co/nvidia/Cosmos-Reason1-7B in the Helm overrides file as shown in Configuration Options.

Note

For more information on the LLM NIM version and profiles, refer to NIM Model Profile Optimization.

Note

When running on an L40 or L40S system, the default startup probe timeout might not be enough for the VILA model to be downloaded and its TRT-LLM engine to be built. To prevent startup timeout, increase the startup timeout by passing --set vss.applicationSpecs.vss-deployment.containers.vss.startupProbe.failureThreshold=360 to the Helm install command or the following to the overrides file.

vss:
  applicationSpecs:
    vss-deployment:
      containers:
        vss:
          startupProbe:
            failureThreshold: 360

Note

This project downloads and installs additional third-party open source software projects. Review the license terms of these open source projects before use.
Audio and CV metadata are not enabled by default. To enable them, refer to Enabling Audio and Enabling CV Pipeline: Set-of-Marks (SOM) and Metadata.
Check FAQ page for deployment failure scenarios with corresponding troubleshooting instructions and commands.

Wait for all services to be up. This can take some time (a few minutes to up to an hour) depending on the setup and configuration. Typically, deploying a second time onwards is faster because the models are cached. Ensure all pods are in Running or Completed STATUS and show 1/1 as READY. You can monitor the services using the following command:

sudo watch -n1 microk8s kubectl get pod

watch refreshes the output every second. Wait for all pods to be in Running or Completed STATUS and show 1/1 as READY.

To ensure the VSS UI is ready and accessible, check logs for deployment using command:

sudo microk8s kubectl logs -l app.kubernetes.io/name=vss

Verify that the following logs are present and that you do not observe errors:

Application startup complete.
Uvicorn running on http://0.0.0.0:9000

If a lot of time has passed since VSS started, kubectl logs might have cleared older logs. In this case look for:

INFO:     10.78.15.132:48016 - "GET /health/ready HTTP/1.1" 200 OK
INFO:     10.78.15.132:50386 - "GET /health/ready HTTP/1.1" 200 OK
INFO:     10.78.15.132:50388 - "GET /health/live HTTP/1.1" 200 OK

Uninstalling the Deployment#

To uninstall the deployment, run the following command:

sudo microk8s helm uninstall vss-blueprint

Default Deployment Topology and Models in Use#

The default deployment topology is as follows.

This is the topology that you observe when checking deployment status using sudo microk8s kubectl get pod:

Microservice/Pod	Description	Default #GPU Allocation
vss-blueprint-0	The NIM LLM (llama-3.1).	4
vss-vss-deployment	VSS Ingestion Pipeline (VLM: default is Cosmos-Reason1) + Retrieval Pipeline (CA-RAG)	2
nemo-embedding-embedding-deployment	NeMo Embedding model used in Retrieval Pipeline	1
nemo-rerank-ranking-deployment	NeMo Reranking model used in Retrieval Pipeline	1
etcd, milvus, neo4j, minio, arango-db, elastic-search	Various databases, data stores and supporting services	NA

Launch VSS UI#

Follow these steps to launch the VSS UI:

Find the service ports:

Run the following command to get the service ports. The output of the command can vary for your setup but it should look similar to this:
```
sudo microk8s kubectl get svc vss-service
# Example output:
vss-service  NodePort <CLUSTER_IP> <none>  8000:32114/TCP,9000:32206/TCP  12m
```
Identify the NodePorts:

Using the output, identify the NodePorts:
- Port 8000 corresponds to the REST API (VSS_API_ENDPOINT); this is mapped to machine’s port 32114
- Port 9000 corresponds to the UI; this is mapped to machine’s port 32206
Access the VSS UI:

Open your browser and navigate to http://<NODE_IP>:32206. Optionally, the VSS REST API is available at http://<NODE_IP>:32114

Note

The <NODE_IP> is the IP address of the machine where the Helm Chart or pod by the name: vss-vss-deployment is deployed.
Test the deployment by summarizing a sample video.
Continue to Configuration Options and VSS Customization when you are ready to customize VSS.

If you run into any errors, refer to the sections FAQ and Known Issues.

Configuration Options#

Some options in the Helm Chart can be configured using the Helm overrides file.

These options include:

VSS deployment time configurations. More info: VSS Deployment-Time Configuration Glossary.
Changing the default models in the VSS Helm Chart. More info: Plug-and-Play Overview.
CV Pipeline Customization.

An example of the overrides.yaml file:

vss:
  applicationSpecs:
    vss-deployment:
      containers:
        vss:
          # Update to override with custom VSS image
          # Set imagePullSecrets if the custom image is hosted on a private registry
          image:
            repository: nvcr.io/nvidia/blueprint/vss-engine
            tag: 2.4.0
          env:
            - name: VLM_MODEL_TO_USE
              value: cosmos-reason1 # Or "openai-compat" or "custom" or "nvila" or "vila-1.5"
            # Specify path in case of VILA-1.5 / NVILA / Cosmos-Reason1 and custom model. Can be either
            # a NGC resource path or a local path. For custom models this
            # must be a path to the directory containing "inference.py" and
            # "manifest.yaml" files.
            - name: MODEL_PATH
              value: "ngc:nim/nvidia/cosmos-reason1-7b:1.1-fp8-dynamic"
            - name: LLM_MODEL
              value: meta/llama-3.1-70b-instruct
            #- name: VILA_ENGINE_NGC_RESOURCE  # Enable to use prebuilt engines from NGC
            #  value: "nvidia/blueprint/vss-vlm-prebuilt-engine:2.3.0-vila-1.5-40b-h100-sxm"
            # - name: DISABLE_GUARDRAILS
            #   value: "false" # "true" to disable guardrails.
            # - name: TRT_LLM_MODE
            #   value: ""  # int4_awq (default), int8 or fp16. (for VILA only)
            # - name: VLM_BATCH_SIZE
            #   value: ""  # Default is determined based on GPU memory. (for VILA only)
            # - name: VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME
            #   value: ""  # Set to use a VLM exposed as a REST API with OpenAI compatible API (e.g. gpt-4o)
            # - name: VIA_VLM_ENDPOINT
            #   value: ""  # Default OpenAI API. Override to use a custom API
            # - name: VIA_VLM_API_KEY
            #   value: ""  # API key to set when calling VIA_VLM_ENDPOINT. Can be set from a secret.
            # - name: OPENAI_API_VERSION
            #   value: ""
            # - name: AZURE_OPENAI_ENDPOINT
            #   value: ""
            # - name: AZURE_OPENAI_API_VERSION
            #   value: ""
            # - name: VSS_LOG_LEVEL
            #   value: "info"
            # - name: VSS_EXTRA_ARGS
            #   value: ""
            # - name: INSTALL_PROPRIETARY_CODECS
            #   value: "true"  # Requires root permissions in the container.
            # - name: DISABLE_FRONTEND
            #   value: "false"
            # - name: DISABLE_CA_RAG
            #   value: "false"
            # - name: VLM_BATCH_SIZE # Applicable only to VILA-1.5 and NVILA models
            #   value: ""
            # - name: ENABLE_VIA_HEALTH_EVAL
            #   value: "true"
            # - name: ENABLE_DENSE_CAPTION
            #   value: "true"
            # - name: VSS_DISABLE_LIVESTREAM_PREVIEW
            #   value: "1"
            # - name: VSS_SKIP_INPUT_MEDIA_VERIFICATION
            #   value: "1"
            # - name: VSS_RTSP_LATENCY
            #   value: "2000"
            # - name: VSS_RTSP_TIMEOUT
            #   value: "2000"
            # - name: VLM_DEFAULT_NUM_FRAMES_PER_CHUNK
            #   value: "8"
            # - name: VLLM_GPU_MEMORY_UTILIZATION
            #   value: "0.4"
            # - name: VLM_SYSTEM_PROMPT
            #   value: "You are a helpful assistant. Answer the users question."
            # - name: ALERT_REVIEW_DEFAULT_VLM_SYSTEM_PROMPT
            #   value: "You are a helpful assistant. Answer the users question with 'yes' or 'no'."
            # - name: CONTEXT_MANAGER_CALL_TIMEOUT
            #   value: "3600"
  resources:
    limits:
      nvidia.com/gpu: 2   # Set to 8 for 2 x 8H100 node deployment
  # nodeSelector:
  #   kubernetes.io/hostname: <node-1>

  # imagePullSecrets:
  #   - name: <imagePullSecretName>

nim-llm:
  resources:
    limits:
      nvidia.com/gpu: 4
  # nodeSelector:
  #   kubernetes.io/hostname: <node-2>

nemo-embedding:
  resources:
    limits:
      nvidia.com/gpu: 1  # Set to 2 for 2 x 8H100 node deployment
  # nodeSelector:
  #   kubernetes.io/hostname: <node-2>

nemo-rerank:
  resources:
    limits:
      nvidia.com/gpu: 1  # Set to 2 for 2 x 8H100 node deployment
  # nodeSelector:
  #   kubernetes.io/hostname: <node-2>

Note

The overrides.yaml file must be created. The Helm chart package does not include it.

The overrides.yaml file provided by default has nvidia.com/gpu limits set for each service to match the default Helm chart deployment topology. Change this limit as needed by editing the value of nvidia.com/gpu in the YAML file for resources / limits section for each of the services, for example, viz, vss, nim-llm, nemo-embedding, and nemo-rerank.

To apply the overrides file while deploying the VSS Helm chart, run:

sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.4.0.tgz \
    --set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml

Note

For a list of all configuration options, observe VSS Deployment-Time Configuration Glossary.

Optional Deployment Topology with GPU Sharing#

You can achieve the following deployment topology by applying the Helm overrides file:

Note

This is only for B200, H100, H200, RTX PRO 6000 Blackwell SE, and A100 (80+ GB device memory).

Microservice/Pod	Description	Default #GPU Allocation
vss-blueprint-0	The NIM LLM (llama-3.1).	4 (index: 0,1,2,3)
vss-vss-deployment	VSS Ingestion Pipeline (VLM: default is Cosmos-Reason1) + Retrieval Pipeline (CA-RAG)	4 (index: 4,5,6,7)
nemo-embedding-embedding-deployment	NeMo Embedding model used in Retrieval Pipeline	1 (index: 4)
nemo-rerank-ranking-deployment	NeMo Reranking model used in Retrieval Pipeline	1 (index: 4)
etcd, milvus, neo4j, minio, arango-db, elastic-search	Various databases, data stores and supporting services	NA

The Helm overrides.yaml file to change deployment topology with GPU sharing.

# 4 + 4 with GPU Sharing topology example
# in a host with 8xH100
# nim-llm on GPU 0,1,2,3
# vss(VLM) on GPU 4,5,6,7
# embedding on GPU 4
# reranking on GPU 4

nim-llm:
  env:
  - name: NVIDIA_VISIBLE_DEVICES
    value: "0,1,2,3"
  - name: NIM_MAX_MODEL_LEN
    value: "128000"
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

vss:
  applicationSpecs:
    vss-deployment:
      containers:
        vss:
          env:
          - name: VLM_MODEL_TO_USE
            value: cosmos-reason1 # Or "openai-compat" or "custom" or "nvila" or "vila-1.5"
          - name: MODEL_PATH
            value: "ngc:nim/nvidia/cosmos-reason1-7b:1.1-fp8-dynamic"
          - name: LLM_MODEL
            value: meta/llama-3.1-70b-instruct
          - name: DISABLE_GUARDRAILS
            value: "false" # "true" to disable guardrails.
          - name: TRT_LLM_MODE
            value: ""  # int4_awq (default), int8 or fp16. (for VILA only)
          - name: VLM_BATCH_SIZE
            value: ""  # Default is determined based on GPU memory. (for VILA only)
          - name: VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME
            value: ""  # Set to use a VLM exposed as a REST API with OpenAI compatible API (e.g. gpt-4o)
          - name: VIA_VLM_ENDPOINT
            value: ""  # Default OpenAI API. Override to use a custom API
          - name: VIA_VLM_API_KEY
            value: ""  # API key to set when calling VIA_VLM_ENDPOINT
          - name: OPENAI_API_VERSION
            value: ""
          - name: AZURE_OPENAI_API_VERSION
            value: ""
          - name: NVIDIA_VISIBLE_DEVICES
            value: "4,5,6,7"
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit


nemo-embedding:
  applicationSpecs:
    embedding-deployment:
      containers:
        embedding-container:
          env:
          - name: NVIDIA_VISIBLE_DEVICES
            value: '4'
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

nemo-rerank:
  applicationSpecs:
    ranking-deployment:
      containers:
        ranking-container:
          env:
          - name: NVIDIA_VISIBLE_DEVICES
            value: '4'
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

To apply the overrides file while deploying the VSS Helm Chart, run:

sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.4.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml

Wait for VSS to be ready and then Launch VSS UI.

Note

Limitations

It bypasses the Kubernetes device plugin resource allocation.
Helm chart might not work in managed Kubernetes cluster services (for example, AKS) where this env is not allowed.
For Kubernetes where CDI is enabled this variable is ignored and GPUs are allocated randomly. So it won’t work as expected.

Configuring GPU Allocation#

The default Helm chart deployment topology is configured for 8xGPUs with each GPU being used by a single service.

To customize the default Helm deployment for various GPU configurations, modify the NVIDIA_VISIBLE_DEVICES environment variable for each of the services in the overrides.yaml file shown below. Additionally, nvidia.com/gpu: 0 must be set to disable GPU allocation by the GPU operator.

# 4 + 4 with GPU Sharing topology example
# in a host with 8xH100
# nim-llm on GPU 0,1,2,3
# vss(VLM) on GPU 4,5,6,7
# embedding on GPU 4
# reranking on GPU 4

nim-llm:
  env:
  - name: NVIDIA_VISIBLE_DEVICES
    value: "0,1,2,3"
  - name: NIM_MAX_MODEL_LEN
    value: "128000"
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

vss:
  applicationSpecs:
    vss-deployment:
      containers:
        vss:
          env:
          - name: VLM_MODEL_TO_USE
            value: cosmos-reason1 # Or "openai-compat" or "custom" or "nvila" or "vila-1.5"
          - name: MODEL_PATH
            value: "ngc:nim/nvidia/cosmos-reason1-7b:1.1-fp8-dynamic"
          - name: LLM_MODEL
            value: meta/llama-3.1-70b-instruct
          - name: DISABLE_GUARDRAILS
            value: "false" # "true" to disable guardrails.
          - name: TRT_LLM_MODE
            value: ""  # int4_awq (default), int8 or fp16. (for VILA only)
          - name: VLM_BATCH_SIZE
            value: ""  # Default is determined based on GPU memory. (for VILA only)
          - name: VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME
            value: ""  # Set to use a VLM exposed as a REST API with OpenAI compatible API (e.g. gpt-4o)
          - name: VIA_VLM_ENDPOINT
            value: ""  # Default OpenAI API. Override to use a custom API
          - name: VIA_VLM_API_KEY
            value: ""  # API key to set when calling VIA_VLM_ENDPOINT
          - name: OPENAI_API_VERSION
            value: ""
          - name: AZURE_OPENAI_API_VERSION
            value: ""
          - name: NVIDIA_VISIBLE_DEVICES
            value: "4,5,6,7"
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit


nemo-embedding:
  applicationSpecs:
    embedding-deployment:
      containers:
        embedding-container:
          env:
          - name: NVIDIA_VISIBLE_DEVICES
            value: '4'
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

nemo-rerank:
  applicationSpecs:
    ranking-deployment:
      containers:
        ranking-container:
          env:
          - name: NVIDIA_VISIBLE_DEVICES
            value: '4'
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

NVIDIA_VISIBLE_DEVICES must be set based on:

The number of GPUs available on the system.
GPU requirements for each service.
- When using VILA-1.5 VLM, VSS requires at least 1 GPU on an 80+ GB GPU (A100, H100, H200, B200, RTX PRO 6000 Blackwell SE) and at least 2 GPUs on a 48 GB GPU (L40s).
- When using Cosmos-Reason1 or NVILA VLM or a remote VLM endpoint, VSS requires at least 1 GPU (A100, H100, H200, B200, RTX PRO 6000 Blackwell SE, L40s).
- Embedding and Reranking require 1 GPU each but can share a GPU with VSS on an 80+ GB GPU.
- RIVA ASR requires 1 GPU but can share a GPU with Embedding and Reranking on an 80+ GB GPU.
- Check NVIDIA NIM for Large Language Models (LLMs) documentation for LLM GPU requirements.
GPUs can be shared even further by using the low memory modes and smaller VLM and LLM models as shown in Fully Local Single GPU Deployment.
If using Using External Endpoints for any of the services, GPUs will not be used by these services and GPU requirements will be further reduced.

Note

For optimal performance on specific hardware platforms, consider using hardware-optimized NIM model profiles. Profiles can vary depending on the number of GPUs and the hardware platform. See NIM Model Profile Optimization for detailed guidance on profile selection and configuration. IMPORTANT: RTX PRO 6000 users MUST use NIM version 1.13.1 and specific profile for llama-3.1-70b-instruct - this is mandatory, not optional.

Fully Local Single GPU Deployment#

Single GPU Deployment recipe using non-default low memory modes and smaller LLMs verified on 1xH100, 1xH200, 1xA100 (80GB+, HBM), 1xB200, 1xRTX PRO 6000 Blackwell SE machine is available below.

This deployment downloads and runs the VLM, LLM, Embedding, and Reranker models locally on one single GPU.

The configuration:

Sets all services (VSS, LLM, embedding, reranking) to share GPU 0
Enables low memory mode and relaxed memory constraints for the LLM
Uses a smaller LLM model (llama-3.1-8b-instruct) suitable for single GPU deployment
Configures the VSS engine to use Cosmos-Reason1 model for vision tasks
Sets appropriate init containers to ensure services start in the correct order

Note

CV and audio related features are currently not supported in Single GPU deployment.

The following overrides file can be used to deploy VSS Helm chart on a single GPU.

nim-llm:
  image:
    repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
    tag: 1.12.0
  llmModel: meta/llama-3.1-8b-instruct
  model:
    name: meta/llama-3.1-8b-instruct
  env:
  - name: NVIDIA_VISIBLE_DEVICES
    value: "0"
  - name: NIM_LOW_MEMORY_MODE
    value: "1"
  - name: NIM_RELAX_MEM_CONSTRAINTS
    value: "1"
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

vss:
  applicationSpecs:
    vss-deployment:
      containers:
        vss:
          image:
            pullPolicy: IfNotPresent
            repository: nvcr.io/nvidia/blueprint/vss-engine
            tag: 2.4.0
          env:
          - name: VLM_MODEL_TO_USE
            value: cosmos-reason1 # Or "openai-compat" or "custom" or "nvila"
          - name: LLM_MODEL
            value: meta/llama-3.1-8b-instruct
          - name: MODEL_PATH
            value: "ngc:nim/nvidia/cosmos-reason1-7b:1.1-fp8-dynamic"
          - name: NVIDIA_VISIBLE_DEVICES
            value: "0"
          - name: CA_RAG_EMBEDDINGS_DIMENSION
            value: "500"
          - name: VLM_BATCH_SIZE
            value: "32"
          - name: VLLM_GPU_MEMORY_UTILIZATION
            value: "0.3"
          - name: DISABLE_GUARDRAILS
            value: "true"
      initContainers:
      - command:
        - sh
        - -c
        - until nc -z -w 2 milvus-milvus-deployment-milvus-service 19530; do echo
          waiting for milvus; sleep 2; done
        image: busybox:1.28
        imagePullPolicy: IfNotPresent
        name: check-milvus-up
      - command:
        - sh
        - -c
        - until nc -z -w 2 neo-4-j-service 7687; do echo waiting for neo4j; sleep
          2; done
        image: busybox:1.28
        imagePullPolicy: IfNotPresent
        name: check-neo4j-up
      - args:
        - "while ! curl -s -f -o /dev/null http://nemo-embedding-embedding-deployment-embedding-service:8000/v1/health/live;\
          \ do\n  echo \"Waiting for nemo-embedding...\"\n  sleep 2\ndone\n"
        command:
        - sh
        - -c
        image: curlimages/curl:latest
        imagePullPolicy: IfNotPresent
        name: check-nemo-embed-up
      - args:
        - "while ! curl -s -f -o /dev/null http://nemo-rerank-ranking-deployment-ranking-service:8000/v1/health/live;\
          \ do\n  echo \"Waiting for nemo-rerank...\"\n  sleep 2\ndone\n"
        command:
        - sh
        - -c
        image: curlimages/curl:latest
        imagePullPolicy: IfNotPresent
        name: check-nemo-rerank-up
      - args:
        - "while ! curl -s -f -o /dev/null http://llm-nim-svc:8000/v1/health/live;\
          \ do\n  echo \"Waiting for LLM...\"\n  sleep 2\ndone\n"
        command:
        - sh
        - -c
        image: curlimages/curl:latest
        name: check-llm-up
  llmModel: meta/llama-3.1-8b-instruct
  llmModelChat: meta/llama-3.1-8b-instruct
  configs:
    ca_rag_config.yaml:
      tools:
        summarization_llm:
          type: llm
          params:
            model: meta/llama-3.1-8b-instruct
        chat_llm:
          type: llm
          params:
            model: meta/llama-3.1-8b-instruct
        notification_llm:
          type: llm
          params:
            model: meta/llama-3.1-8b-instruct
    guardrails_config.yaml:
      models:
      - engine: nim
        model: meta/llama-3.1-8b-instruct
        parameters:
          base_url: http://llm-nim-svc:8000/v1
        type: main
      - engine: nim
        model: nvidia/llama-3.2-nv-embedqa-1b-v2
        parameters:
          base_url: http://nemo-embedding-embedding-deployment-embedding-service:8000/v1
        type: embeddings
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit


nemo-embedding:
  applicationSpecs:
    embedding-deployment:
      containers:
        embedding-container:
          env:
          - name: NVIDIA_VISIBLE_DEVICES
            value: '0'
          - name: NIM_MODEL_PROFILE
            value: "f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f" # model profile: fp16-onnx-onnx
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

nemo-rerank:
  applicationSpecs:
    ranking-deployment:
      containers:
        ranking-container:
          env:
          - name: NVIDIA_VISIBLE_DEVICES
            value: '0'
          - name: NIM_MODEL_PROFILE
            value: "f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f" # model profile: fp16-onnx-onnx
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

Note

Guardrails has been disabled for single GPU deployment because of accuracy issues with the llama-3.1-8b-instruct model. If required, it can be enabled by removing the DISABLE_GUARDRAILS environment variable from the overrides file.

To apply the overrides file while deploying the VSS Helm chart, run:

sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.4.0.tgz \
    --set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml

Note

For more information on the LLM NIM version and profiles, refer to NIM Model Profile Optimization.

Wait for VSS to be ready and then Launch VSS UI.

Enabling Audio#

The following overrides are required to enable audio in Summarization and Q&A:

# 4 + 3 + 1 with GPU Sharing topology example
# in a host with 8xH100
# nim-llm on GPU 0,1,2,3
# vss(VLM) on GPU 5,6,7
# embedding on GPU 4
# reranking on GPU 4
# riva ASR on GPU 4

nim-llm:
  env:
  - name: NVIDIA_VISIBLE_DEVICES
    value: "0,1,2,3"
  - name: NIM_MAX_MODEL_LEN
    value: "128000"
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

vss:
  applicationSpecs:
    vss-deployment:
      containers:
        vss:
          applicationSpecs:
            vss-deployment:
              securityContext:
                fsGroup: 0
                runAsGroup: 0
                runAsUser: 0
          env:
          - name: VLM_MODEL_TO_USE
            value: cosmos-reason1 # Or "openai-compat" or "custom" or "nvila" or "vila-1.5"
          - name: MODEL_PATH
            value: "ngc:nim/nvidia/cosmos-reason1-7b:1.1-fp8-dynamic"
          - name: LLM_MODEL
            value: meta/llama-3.1-70b-instruct
          - name: INSTALL_PROPRIETARY_CODECS
            value: "true"
          - name: NVIDIA_VISIBLE_DEVICES
            value: "5,6,7"
          - name: ENABLE_AUDIO
            value: "true"
          - name: ENABLE_RIVA_SERVER_READINESS_CHECK
            value: "true"
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit


nemo-embedding:
  applicationSpecs:
    embedding-deployment:
      containers:
        embedding-container:
          env:
          - name: NVIDIA_VISIBLE_DEVICES
            value: '4'
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

nemo-rerank:
  applicationSpecs:
    ranking-deployment:
      containers:
        ranking-container:
          env:
          - name: NVIDIA_VISIBLE_DEVICES
            value: '4'
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

riva:
  enabled: true
  applicationSpecs:
    riva-deployment:
      containers:
        riva-container:
          env:
          - name: NIM_TAGS_SELECTOR
            value: name=parakeet-0-6b-ctc-riva-en-us,mode=all
          - name: NVIDIA_VISIBLE_DEVICES
            value: '4'
  resources:
    limits:
      nvidia.com/gpu: 0

Things to note in the above overrides file:

The overrides file assumes a specific GPU topology (documented in the beginning of the overrides file).
The setting nvidia.com/gpu: 0 for all microservices is 0 (means no limit in GPU allocation).
GPU allocation to each microservice is handled using env NVIDIA_VISIBLE_DEVICES.
riva microservice is enabled and configured to use GPU 4, which is shared with VSS, embedding and reranking pods.
riva microservice is configured to use parakeet-0-6b-ctc-riva-en-us model.
VSS has audio enabled by setting ENABLE_AUDIO to true.
ENABLE_RIVA_SERVER_READINESS_CHECK is set to true. This enables readiness check for Riva ASR server at VSS startup.
VSS has proprietary codecs enabled by setting INSTALL_PROPRIETARY_CODECS to true. This will install additional open source and proprietary codecs. Review their license terms. This is required for additional audio codecs support.

Installing the additional codecs and open source packages requires root permissions, this can be done by setting the following in the overrides file to run the container as root. Alternatively, a Custom Container Image with Codecs Installed can be used.

...
vss:
  applicationSpecs:
    vss-deployment:
      securityContext:
        fsGroup: 0
        runAsGroup: 0
        runAsUser: 0
...

Note

This has been tested with 8 x H100 GPUs.

To apply the overrides file while deploying the VSS Helm chart, run:

sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.4.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml

Wait for VSS to be ready and then Launch VSS UI.

Enabling CV Pipeline: Set-of-Marks (SOM) and Metadata#

The following overrides are required to enable the CV pipeline:

# 4 + 3 + 1 with GPU Sharing topology example
# in a host with 8xH100
# nim-llm on GPU 0,1,2,3
# vss(VLM) on GPU 5,6,7
# embedding on GPU 4
# reranking on GPU 4


nim-llm:
  env:
  - name: NVIDIA_VISIBLE_DEVICES
    value: "0,1,2,3"
  - name: NIM_MAX_MODEL_LEN
    value: "128000"
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit


vss:
  applicationSpecs:
    vss-deployment:
      securityContext:
        fsGroup: 0
        runAsGroup: 0
        runAsUser: 0
      containers:
        vss:
          env:
          - name: VLM_MODEL_TO_USE
            value: cosmos-reason1 # Or "openai-compat" or "custom" or "nvila" or "vila-1.5"
          - name: MODEL_PATH
            value: "ngc:nim/nvidia/cosmos-reason1-7b:1.1-fp8-dynamic"
          - name: LLM_MODEL
            value: meta/llama-3.1-70b-instruct
          - name: NVIDIA_VISIBLE_DEVICES
            value: "5,6,7"
          - name: INSTALL_PROPRIETARY_CODECS
            value: "true"
          - name: DISABLE_CV_PIPELINE
            value: "false"
          - name: GDINO_INFERENCE_INTERVAL
            value: "1"
          - name: NUM_CV_CHUNKS_PER_GPU
            value: "2"
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit



nemo-embedding:
  applicationSpecs:
    embedding-deployment:
      containers:
        embedding-container:
          env:
          - name: NVIDIA_VISIBLE_DEVICES
            value: '4'
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit


nemo-rerank:
  applicationSpecs:
    ranking-deployment:
      containers:
        ranking-container:
          env:
          - name: NVIDIA_VISIBLE_DEVICES
            value: '4'
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

Things to note in the above overrides file:

The overrides file assumes a specific GPU topology (documented in the beginning of the overrides file).
The NUM_CV_CHUNKS_PER_GPU is set to 2. This should be lowered to 1 for lower memory GPUs like L40s.
The setting nvidia.com/gpu: 0 for all microservices is 0 (means no limit in GPU allocation).
GPU allocation to each microservice is handled using env NVIDIA_VISIBLE_DEVICES.
VSS has CV pipeline enabled by setting DISABLE_CV_PIPELINE to false. This will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.
VSS has proprietary codecs enabled by setting INSTALL_PROPRIETARY_CODECS to true. This will install additional open source and proprietary codecs. Review their license terms. This is required for Set-of-Marks overlay preview.

Installing the additional codecs and open source packages requires root permissions, this can be done by setting the following in the overrides file to run the container as root. Alternatively, a Custom Container Image with Codecs Installed can be used.

...
vss:
  applicationSpecs:
    vss-deployment:
      securityContext:
        fsGroup: 0
        runAsGroup: 0
        runAsUser: 0
...

Note

This has been tested with 8 x H100 GPUs.

To apply the overrides file while deploying the VSS Helm chart, run:

sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.4.0.tgz \
    --set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml

Next, Wait for VSS to be ready and then Launch VSS UI.

Configuring CA-RAG Configuration#

To customize the CA-RAG configuration, modify the ca_rag_config.yaml section in the overrides file. More information on the CA-RAG configuration can be found in CA-RAG Configuration.

Note: The endpoints for models are already configured to use the models deployed as part of the Helm Chart.

Here’s an example of the CA-RAG configuration overrides ingestion and retriever functions to vector-rag and elasticsearchDB.

vss:
  applicationSpecs:
    vss-deployment:
      containers:
        vss:
          env:
            - name: VLM_MODEL_TO_USE
              value: cosmos-reason1
            - name: LLM_MODEL
              value: meta/llama-3.1-70b-instruct

  configs:
    ca_rag_config.yaml:
      functions:
        summarization:
          type: batch_summarization
          params:
            batch_size: 6 # Use even batch size if speech recognition enabled.
            batch_max_concurrency: 20
            prompts:
              caption: "Write a concise and clear dense caption for the provided warehouse video, focusing on irregular or hazardous events such as boxes falling, workers not wearing PPE, workers falling, workers taking photographs, workers chitchatting, forklift stuck, etc. Start and end each sentence with a time stamp."
              caption_summarization: "You should summarize the following events of a warehouse in the format start_time:end_time:caption. For start_time and end_time use . to seperate seconds, minutes, hours. If during a time segment only regular activities happen, then ignore them, else note any irregular activities in detail. The output should be bullet points in the format start_time:end_time: detailed_event_description. Don't return anything else except the bullet points."
              summary_aggregation: "You are a warehouse monitoring system. Given the caption in the form start_time:end_time: caption, Aggregate the following captions in the format start_time:end_time:event_description. If the event_description is the same as another event_description, aggregate the captions in the format start_time1:end_time1,...,start_timek:end_timek:event_description. If any two adjacent end_time1 and start_time2 is within a few tenths of a second, merge the captions in the format start_time1:end_time2. The output should only contain bullet points.  Cluster the output into Unsafe Behavior, Operational Inefficiencies, Potential Equipment Damage and Unauthorized Personnel"
          tools:
            llm: summarization_llm
            db: elasticsearch_db

        ingestion_function:
          type: vector_ingestion
          tools:
            db: elasticsearch_db
            llm: chat_llm

        retriever_function:
          type: vector_retrieval
          params:
            top_k: 5
          tools:
            db: elasticsearch_db
            llm: chat_llm
            reranker: nvidia_reranker
      tools:       
        elasticsearch_db:
          type: elasticsearch
          params:
            host: ${ES_HOST}
            port: ${ES_PORT}
          tools:
            embedding: nvidia_embedding

  resources:
    limits:
      nvidia.com/gpu: 2

nim-llm:
  resources:
    limits:
      nvidia.com/gpu: 4

nemo-embedding:
  resources:
    limits:
      nvidia.com/gpu: 1

nemo-rerank:
  resources:
    limits:
      nvidia.com/gpu: 1   

Multi-Node Deployment#

Multi-node deployments can be used in cases where more resources (for example, GPUs) are required than available on a single node. For example, an eight-GPU LLM, six-GPU VLM, one-GPU Embedding, and one-GPU Reranking topology can be deployed on two 8xH100 nodes.

While services can be distributed across multiple nodes, each individual service and its associated containers must run entirely on a single node that has sufficient resources. Services cannot be split across multiple nodes to utilize their combined resources.

By default, Kubernetes schedules the pods based on resource availability automatically.

To explicitly schedule a particular pod on a particular node, run the following command. This command schedules the embedding and reranking services on the second node.

sudo microk8s kubectl get node

helm install vss-blueprint nvidia-blueprint-vss-2.4.0.tgz \
   --set global.ngcImagePullSecretName=ngc-docker-reg-secret \
   --set vss.nodeSelector."kubernetes\.io/hostname"="<Name of Node #2>"  \
   --set nemo-embedding.nodeSelector."kubernetes\.io/hostname"="<Name of Node #2>"  \
   --set nemo-rerank.nodeSelector."kubernetes\.io/hostname"="<Name of Node #2>"

This can be done for any pod by adding the nodeSelector to the overrides file.

...
<service-name>:
  nodeSelector:
    "kubernetes.io/hostname": "<Name of Node #2>"
...

Note

If you have issues with multi-node deployment, try the following:

Try setting nodeSelector on each service as shown above when deploying.
Try deleting existing PVCs sudo microk8s kubectl delete pvc model-store-vss-blueprint-0 vss-ngc-model-cache-pvc before redeploying.