Deploy Using Helm#

Before proceeding, make sure all of the prerequisites have been met.

Create Required Secrets#

These secrets allow your Kubernetes applications to securely access NVIDIA resources and your database without hardcoding credentials in your application code or container images.

To deploy the secrets required by the VSS Blueprint:

Note

If using microk8s, prepend the kubectl commands with sudo microk8s. For example, sudo microk8s kubectl .... To join the group for admin access, avoid using sudo, and other information about microk8s setup/usage, review: https://microk8s.io/docs/getting-started. If not using microk8s, you may use kubectl directly. For example, kubectl get pod.

# Export NGC_API_KEY

export NGC_API_KEY=<YOUR_LEGACY_NGC_API_KEY>

# Create credentials for pulling images from NGC (nvcr.io)

sudo microk8s kubectl create secret docker-registry ngc-docker-reg-secret \
    --docker-server=nvcr.io \
    --docker-username='$oauthtoken' \
    --docker-password=$NGC_API_KEY

# Configure login information for Neo4j graph database

sudo microk8s kubectl create secret generic graph-db-creds-secret \
    --from-literal=username=neo4j --from-literal=password=password

# Configure the legacy NGC API key for downloading models from NGC

sudo microk8s kubectl create secret generic ngc-api-key-secret \
--from-literal=NGC_API_KEY=$NGC_API_KEY

Deploy the Helm Chart#

To deploy the VSS Blueprint Helm Chart:

# Fetch the VSS Blueprint Helm Chart

sudo microk8s helm fetch \
    https://helm.ngc.nvidia.com/nvidia/blueprint/charts/nvidia-blueprint-vss-2.3.0.tgz \
    --username='$oauthtoken' --password=$NGC_API_KEY

# Install the Helm Chart

sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.3.0.tgz \
    --set global.ngcImagePullSecretName=ngc-docker-reg-secret

Note

When running on an L40 / L40S system, the default startup probe timeout might not be enough for the VILA model to be downloaded and its TRT-LLM engine to be built. To prevent startup timeout, increase the startup timeout by passing --set vss.applicationSpecs.vss-deployment.containers.vss.startupProbe.failureThreshold=360 to the helm install command or the following to the overrides file.

vss:
  applicationSpecs:
    vss-deployment:
      containers:
        vss:
          startupProbe:
            failureThreshold: 360

Note

This project downloads and installs additional third-party open source software projects. Review the license terms of these open source projects before use.

Note

Audio and CV metadata are not enabled by default. To enable them, see Enabling Audio and Enabling CV Pipeline: Set-Of-Marks (SOM) & Metadata.

Note

Please check FAQ page for deployment failure scenarios with corresponding troubleshooting instructions and commands.

Wait for all services to be up. This may take some time (a few minutes to upto an hour) depending on the setup and configuration. Deploying second time onwards should be faster since the models are cached. Make sure all pods are in Running or Completed STATUS and shows 1/1 as READY. You can monitor the services using the following command:

sudo watch -n1 microk8s kubectl get pod

watch will refresh the output every second. Wait for all pods to be in Running or Completed STATUS and shows 1/1 as READY.

To make sure the VSS UI is ready and accessible, please check logs for deployment using command:

sudo microk8s kubectl logs -l app.kubernetes.io/name=vss

Please make sure the below logs are present and user does not see any errors:

Application startup complete.
Uvicorn running on http://0.0.0.0:9000

If a lot of time has passed since VSS started, kubectl logs might have cleared older logs. In this case look for:

INFO:     10.78.15.132:48016 - "GET /health/ready HTTP/1.1" 200 OK
INFO:     10.78.15.132:50386 - "GET /health/ready HTTP/1.1" 200 OK
INFO:     10.78.15.132:50388 - "GET /health/live HTTP/1.1" 200 OK

Uninstalling the deployment#

To uninstall the deployment, run the following command:

sudo microk8s helm uninstall vss-blueprint

Default Deployment Topology and Models in Use#

The default deployment topology is as follows.

This is the topology that you see when checking deployment status using sudo microk8s kubectl get pod.

Microservice/Pod

Description

Default #GPU Allocation

vss-blueprint-0

The NIM LLM (llama-3.1).

4

vss-vss-deployment

VSS Ingestion Pipeline (VLM: default is VILA1.5) + Retrieval Pipeline (CA-RAG)

2

nemo-embedding-embedding-deployment

NeMo Embedding model used in Retrieval Pipeline

1

nemo-rerank-ranking-deployment

NeMo Reranking model used in Retrieval Pipeline

1

etcd-etcd-deployment, milvus-milvus-deployment, neo4j-neo4j-deployment,

Milvus and Neo4j databases used for vector and graph storage.

NA

Launch VSS UI#

Find the ports where the VSS REST API and UI server are running:

sudo microk8s kubectl get svc vss-service

vss-service  NodePort <CLUSTER_IP> <none>  8000:32114/TCP,9000:32206/TCP  12m

The NodePorts corresponding to port 8000 and 9000 are for the REST API (VSS_API_ENDPOINT) and the UI respectively. The VSS UI is available at the NodePort corresponding to 9000. In this example, the VSS UI can be accessed by opening http://<NODE_IP>:32206 in a browser. VSS REST API address is http://<NODE_IP>:32114.

Note

The <NODE_IP> is the IP address of the machine where the Helm Chart or pod by the name: vss-vss-deployment is deployed.

Next, test the deployment by summarizing a sample video. Continue to Configuration Options and VSS Customization when you are ready to customize VSS.

If running into any errors, please refer to the sections FAQ and Known Issues.

Configuration Options#

Some options in the Helm Chart can be configured using the Helm overrides file.

These options include:

  1. VSS deployment time configurations. More info: VSS Deployment-Time Configuration Glossary.

  2. Changing the default models in the VSS helm chart. More info: Plug-and-Play Overview.

  3. CV Pipeline Customization.

An example of the overrides.yaml file:

vss:
  applicationSpecs:
    vss-deployment:
      containers:
        vss:
          # Update to override with custom VSS image
          # Set imagePullSecrets if the custom image is hosted on a private registry
          image: 
            repository: nvcr.io/nvidia/blueprint/vss-engine
            tag: 2.3.0
          env:
            - name: VLM_MODEL_TO_USE
              value: vila-1.5 # Or "openai-compat" or "custom" or "nvila"
            # Specify path in case of VILA-1.5 and custom model. Can be either
            # a NGC resource path or a local path. For custom models this
            # must be a path to the directory containing "inference.py" and
            # "manifest.yaml" files.
            - name: MODEL_PATH
              value: "ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8"
            #- name: VILA_ENGINE_NGC_RESOURCE  # Enable to use prebuilt engines from NGC
            #  value: "nvidia/blueprint/vss-vlm-prebuilt-engine:2.3.0-vila-1.5-40b-h100-sxm"
            # - name: DISABLE_GUARDRAILS
            #   value: "false" # "true" to disable guardrails.
            # - name: TRT_LLM_MODE
            #   value: ""  # int4_awq (default), int8 or fp16. (for VILA only)
            # - name: VLM_BATCH_SIZE
            #   value: ""  # Default is determined based on GPU memory. (for VILA only)
            # - name: VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME
            #   value: ""  # Set to use a VLM exposed as a REST API with OpenAI compatible API (e.g. gpt-4o)
            # - name: VIA_VLM_ENDPOINT
            #   value: ""  # Default OpenAI API. Override to use a custom API
            # - name: VIA_VLM_API_KEY
            #   value: ""  # API key to set when calling VIA_VLM_ENDPOINT. Can be set from a secret.
            # - name: OPENAI_API_VERSION
            #   value: ""
            # - name: AZURE_OPENAI_ENDPOINT
            #   value: ""
            # - name: AZURE_OPENAI_API_VERSION
            #   value: ""
            # - name: VSS_LOG_LEVEL
            #   value: "info"
            # - name: VSS_EXTRA_ARGS
            #   value: ""
            # - name: INSTALL_PROPRIETARY_CODECS
            #   value: "true"  # Requires root permissions in the container.
            # - name: DISABLE_FRONTEND
            #   value: "false"
            # - name: DISABLE_CA_RAG
            #   value: "false"
            # - name: VLM_BATCH_SIZE # Applicable only to VILA-1.5 and NVILA models
            #   value: ""
            # - name: ENABLE_VIA_HEALTH_EVAL
            #   value: "true"
            # - name: ENABLE_DENSE_CAPTION
            #   value: "true"
            # - name: VSS_DISABLE_LIVESTREAM_PREVIEW
            #   value: "1"
            # - name: VSS_SKIP_INPUT_MEDIA_VERIFICATION
            #   value: "1"
            # - name: VSS_RTSP_LATENCY
            #   value: "2000"
            # - name: VSS_RTSP_TIMEOUT
            #   value: "2000"
  resources:
    limits:
      nvidia.com/gpu: 2   # Set to 8 for 2 x 8H100 node deployment
  # nodeSelector:
  #   kubernetes.io/hostname: <node-1>

  # imagePullSecrets:
  #   - name: <imagePullSecretName>

nim-llm:
  resources:
    limits:
      nvidia.com/gpu: 4
  # nodeSelector:
  #   kubernetes.io/hostname: <node-2>

nemo-embedding:
  resources:
    limits:
      nvidia.com/gpu: 1  # Set to 2 for 2 x 8H100 node deployment
  # nodeSelector:
  #   kubernetes.io/hostname: <node-2>

nemo-rerank:
  resources:
    limits:
      nvidia.com/gpu: 1  # Set to 2 for 2 x 8H100 node deployment
  # nodeSelector:
  #   kubernetes.io/hostname: <node-2>

Note

The overrides.yaml file must be created. The helm chart package does not include it.

Note

The overrides.yaml file provided by default have nvidia.com/gpu limits set for each service to match the default helm chart deployment topology. Please change this limit as needed by editing the value of nvidia.com/gpu in the yaml file for resources / limits section for each of the service - viz. vss, nim-llm, nemo-embedding, and nemo-rerank.

To apply the overrides file while deploying the VSS Helm Chart, run:

sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.3.0.tgz \
    --set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml

Note

For an exhaustive list of all configuration options, see VSS Deployment-Time Configuration Glossary.

Optional Deployment Topology with GPU sharing#

You can achieve below deployement topology by applying the Helm overides file.

Note

This is only for H100 / H200 / A100 (80+ GB device memory).

Microservice/Pod

Description

Default #GPU Allocation

vss-blueprint-0

The NIM LLM (llama-3.1).

4 (index: 0,1,2,3)

vss-vss-deployment

VSS Ingestion Pipeline (VLM: default is VILA1.5) + Retrieval Pipeline (CA-RAG)

4 (index: 4,5,6,7)

nemo-embedding-embedding-deployment

NeMo Embedding model used in Retrieval Pipeline

1 (index: 4)

nemo-rerank-ranking-deployment

NeMo Reranking model used in Retrieval Pipeline

1 (index: 4)

etcd-etcd-deployment, milvus-milvus-deployment, neo4j-neo4j-deployment,

Milvus and Neo4j databases used for vector and graph storage.

NA

The Helm overrides.yaml file to change deployment topology with GPU sharing.

# 4 + 4 with GPU Sharing topology example
# in a host with 8xH100
# nim-llm on GPU 0,1,2,3
# vss(VLM) on GPU 4,5,6,7
# embedding on GPU 4
# reranking on GPU 4

nim-llm:
  env:
  - name: NVIDIA_VISIBLE_DEVICES
    value: "0,1,2,3"
  - name: NIM_MAX_MODEL_LEN
    value: "128000"
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

vss:
  applicationSpecs:
    vss-deployment:
      containers:
        vss:
          env:
          - name: VLM_MODEL_TO_USE
            value: vila-1.5 # Or "openai-compat" or "custom" or "nvila"
          - name: MODEL_PATH
            value: "ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8"
          - name: DISABLE_GUARDRAILS
            value: "false" # "true" to disable guardrails.
          - name: TRT_LLM_MODE
            value: ""  # int4_awq (default), int8 or fp16. (for VILA only)
          - name: VLM_BATCH_SIZE
            value: ""  # Default is determined based on GPU memory. (for VILA only)
          - name: VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME
            value: ""  # Set to use a VLM exposed as a REST API with OpenAI compatible API (e.g. gpt-4o)
          - name: VIA_VLM_ENDPOINT
            value: ""  # Default OpenAI API. Override to use a custom API
          - name: VIA_VLM_API_KEY
            value: ""  # API key to set when calling VIA_VLM_ENDPOINT
          - name: OPENAI_API_VERSION
            value: ""
          - name: AZURE_OPENAI_API_VERSION
            value: ""
          - name: NVIDIA_VISIBLE_DEVICES
            value: "4,5,6,7"
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit


nemo-embedding:
  applicationSpecs:
    embedding-deployment:
      containers:
        embedding-container:
          env:
          - name: NGC_API_KEY
            valueFrom:
              secretKeyRef:
                key: NGC_API_KEY
                name: ngc-api-key-secret
          - name: NVIDIA_VISIBLE_DEVICES
            value: '4'
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

nemo-rerank:
  applicationSpecs:
    ranking-deployment:
      containers:
        ranking-container:
          env:
          - name: NGC_API_KEY
            valueFrom:
              secretKeyRef:
                key: NGC_API_KEY
                name: ngc-api-key-secret
          - name: NVIDIA_VISIBLE_DEVICES
            value: '4'
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

To apply the overrides file while deploying the VSS Helm Chart, run:

sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.3.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml

Next, Wait for VSS to be ready and then Launch VSS UI.

Note

Limitation

  • It bypass the k8s device plugin resource allocation.

  • Helm chart might not work in managed k8s cluster services(e.g. AKS) where this env is not allowed.

  • For k8s where CDI is enabled this variable is ignored and GPUs are allocated randomly. So won’t work as expected.

Configuring GPU Allocation#

As mentioned earlier, the default helm chart deployment topology is configured for 8xGPUs with each GPU being used by a single service.

To customize the default helm deployment for various GPU configurations, modify the NVIDIA_VISIBLE_DEVICES environment variable for each of the services in the overrides.yaml file shown below. Additionally, nvidia.com/gpu: 0 must be set to disable GPU allocation by the GPU operator.

# 4 + 4 with GPU Sharing topology example
# in a host with 8xH100
# nim-llm on GPU 0,1,2,3
# vss(VLM) on GPU 4,5,6,7
# embedding on GPU 4
# reranking on GPU 4

nim-llm:
  env:
  - name: NVIDIA_VISIBLE_DEVICES
    value: "0,1,2,3"
  - name: NIM_MAX_MODEL_LEN
    value: "128000"
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

vss:
  applicationSpecs:
    vss-deployment:
      containers:
        vss:
          env:
          - name: VLM_MODEL_TO_USE
            value: vila-1.5 # Or "openai-compat" or "custom" or "nvila"
          - name: MODEL_PATH
            value: "ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8"
          - name: DISABLE_GUARDRAILS
            value: "false" # "true" to disable guardrails.
          - name: TRT_LLM_MODE
            value: ""  # int4_awq (default), int8 or fp16. (for VILA only)
          - name: VLM_BATCH_SIZE
            value: ""  # Default is determined based on GPU memory. (for VILA only)
          - name: VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME
            value: ""  # Set to use a VLM exposed as a REST API with OpenAI compatible API (e.g. gpt-4o)
          - name: VIA_VLM_ENDPOINT
            value: ""  # Default OpenAI API. Override to use a custom API
          - name: VIA_VLM_API_KEY
            value: ""  # API key to set when calling VIA_VLM_ENDPOINT
          - name: OPENAI_API_VERSION
            value: ""
          - name: AZURE_OPENAI_API_VERSION
            value: ""
          - name: NVIDIA_VISIBLE_DEVICES
            value: "4,5,6,7"
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit


nemo-embedding:
  applicationSpecs:
    embedding-deployment:
      containers:
        embedding-container:
          env:
          - name: NGC_API_KEY
            valueFrom:
              secretKeyRef:
                key: NGC_API_KEY
                name: ngc-api-key-secret
          - name: NVIDIA_VISIBLE_DEVICES
            value: '4'
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

nemo-rerank:
  applicationSpecs:
    ranking-deployment:
      containers:
        ranking-container:
          env:
          - name: NGC_API_KEY
            valueFrom:
              secretKeyRef:
                key: NGC_API_KEY
                name: ngc-api-key-secret
          - name: NVIDIA_VISIBLE_DEVICES
            value: '4'
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

NVIDIA_VISIBLE_DEVICES must be set based on:

  • The number of GPUs available on the system.

  • GPU requirements for each service.

    • When using VILA-1.5 VLM, VSS requires at least 1 GPU on an 80+ GB GPU (A100, H100, H200) and at least 2 GPUs on an 48 GB GPU (L40s).

    • When using NVILA VLM or a remote VLM endpoint, VSS requires at least 1 GPU (A100, H100, H200, L40s).

    • Embedding and Reranking require 1 GPU each but can share a GPU with VSS on an 80+ GB GPU.

    • Check NVIDIA NIM for Large Language Models (LLMs) documentation for LLM GPU requirements.

  • GPUs can be shared even further by using the low memory modes and smaller VLM and LLM models as shown in Fully Local Single GPU Deployment.

  • If using Using External Endpoints for any of the services, GPUs will not be used by these services and GPU requirements will be further reduced.

Fully Local Single GPU Deployment#

Single GPU Deployment recipe using non-default low memory modes and smaller LLMs verified on 1XH100, 1XH200, 1XA100 (80GB+, HBM) machine is available below.

This deployment downloads and runs the VLM, LLM, Embedding, and Reranker models locally on one single GPU.

The configuration:

  • Sets all services (VSS, LLM, embedding, reranking) to share GPU 0

  • Enables low memory mode and relaxed memory constraints for the LLM

  • Uses a smaller LLM model (llama-3.1-8b-instruct) suitable for single GPU deployment

  • Configures the VSS engine to use NVILA model for vision tasks

  • Sets appropriate init containers to ensure services start in the correct order

Note

CV and audio related features are currently not supported in Single GPU deployment.

The following overrides file can be used to deploy VSS Helm Chart on a single GPU.

nim-llm:
  image:
    repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
    tag: 1.3.3
  llmModel: meta/llama-3.1-8b-instruct
  model:
    name: meta/llama-3.1-8b-instruct
  env:
  - name: NVIDIA_VISIBLE_DEVICES
    value: "0"
  - name: NIM_LOW_MEMORY_MODE
    value: "1"
  - name: NIM_RELAX_MEM_CONSTRAINTS
    value: "1"
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

vss:
  applicationSpecs:
    vss-deployment:
      containers:
        vss:
          image:
            pullPolicy: IfNotPresent
            repository: nvcr.io/nvidia/blueprint/vss-engine
            tag: 2.3.0
          env:
          - name: VLM_MODEL_TO_USE
            value: nvila # Or "openai-compat" or "custom" or "nvila"
          - name: MODEL_PATH
            value: "git:https://huggingface.co/Efficient-Large-Model/NVILA-15B"
          - name: NVIDIA_VISIBLE_DEVICES
            value: "0"
          - name: CA_RAG_EMBEDDINGS_DIMENSION
            value: "500"
      initContainers:
      - command:
        - sh
        - -c
        - until nc -z -w 2 milvus-milvus-deployment-milvus-service 19530; do echo
          waiting for milvus; sleep 2; done
        image: busybox:1.28
        imagePullPolicy: IfNotPresent
        name: check-milvus-up
      - command:
        - sh
        - -c
        - until nc -z -w 2 neo-4-j-service 7687; do echo waiting for neo4j; sleep
          2; done
        image: busybox:1.28
        imagePullPolicy: IfNotPresent
        name: check-neo4j-up
      - args:
        - "while ! curl -s -f -o /dev/null http://nemo-embedding-embedding-deployment-embedding-service:8000/v1/health/live;\
          \ do\n  echo \"Waiting for nemo-embedding...\"\n  sleep 2\ndone\n"
        command:
        - sh
        - -c
        image: curlimages/curl:latest
        imagePullPolicy: IfNotPresent
        name: check-nemo-embed-up
      - args:
        - "while ! curl -s -f -o /dev/null http://nemo-rerank-ranking-deployment-ranking-service:8000/v1/health/live;\
          \ do\n  echo \"Waiting for nemo-rerank...\"\n  sleep 2\ndone\n"
        command:
        - sh
        - -c
        image: curlimages/curl:latest
        imagePullPolicy: IfNotPresent
        name: check-nemo-rerank-up
      - args:
        - "while ! curl -s -f -o /dev/null http://llm-nim-svc:8000/v1/health/live;\
          \ do\n  echo \"Waiting for LLM...\"\n  sleep 2\ndone\n"
        command:
        - sh
        - -c
        image: curlimages/curl:latest
        name: check-llm-up
  llmModel: meta/llama-3.1-8b-instruct
  llmModelChat: meta/llama-3.1-8b-instruct
  configs:
    ca_rag_config.yaml:
      chat:
        llm:
          model: meta/llama-3.1-8b-instruct
      notification:
        llm:
          model: meta/llama-3.1-8b-instruct
      summarization:
        llm:
          model: meta/llama-3.1-8b-instruct
    guardrails_config.yaml:
      models:
      - engine: nim
        model: meta/llama-3.1-8b-instruct
        parameters:
          base_url: http://llm-nim-svc:8000/v1
        type: main
      - engine: nim
        model: nvidia/llama-3.2-nv-embedqa-1b-v2
        parameters:
          base_url: http://nemo-embedding-embedding-deployment-embedding-service:8000/v1
        type: embeddings
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit


nemo-embedding:
  applicationSpecs:
    embedding-deployment:
      containers:
        embedding-container:
          env:
          - name: NGC_API_KEY
            valueFrom:
              secretKeyRef:
                key: NGC_API_KEY
                name: ngc-api-key-secret
          - name: NVIDIA_VISIBLE_DEVICES
            value: '0'
          - name: NIM_MODEL_PROFILE
            value: "f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f" # model profile: fp16-onnx-onnx
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

nemo-rerank:
  applicationSpecs:
    ranking-deployment:
      containers:
        ranking-container:
          env:
          - name: NGC_API_KEY
            valueFrom:
              secretKeyRef:
                key: NGC_API_KEY
                name: ngc-api-key-secret
          - name: NVIDIA_VISIBLE_DEVICES
            value: '0'
          - name: NIM_MODEL_PROFILE
            value: "f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f" # model profile: fp16-onnx-onnx
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

To apply the overrides file while deploying the VSS Helm Chart, run:

sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.3.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml

Next, Wait for VSS to be ready and then Launch VSS UI.

Enabling Audio#

The following overrides are required to enable audio in Summarization and Q&A:

# 4 + 3 + 1 with GPU Sharing topology example
# in a host with 8xH100
# nim-llm on GPU 0,1,2,3
# vss(VLM) on GPU 5,6,7
# embedding on GPU 4
# reranking on GPU 4
# riva ASR on GPU 4

nim-llm:
  env:
  - name: NVIDIA_VISIBLE_DEVICES
    value: "0,1,2,3"
  - name: NIM_MAX_MODEL_LEN
    value: "128000"
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

vss:
  applicationSpecs:
    vss-deployment:
      containers:
        vss:
          env:
          - name: VLM_MODEL_TO_USE
            value: vila-1.5 # Or "openai-compat" or "custom" or "nvila"
          - name: MODEL_PATH
            value: "ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8"
          - name: INSTALL_PROPRIETARY_CODECS
            value: "true"
          - name: NVIDIA_VISIBLE_DEVICES
            value: "5,6,7"
          - name: ENABLE_AUDIO
            value: "true"
          - name: ENABLE_RIVA_SERVER_READINESS_CHECK
            value: "true"
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit


nemo-embedding:
  applicationSpecs:
    embedding-deployment:
      containers:
        embedding-container:
          env:
          - name: NGC_API_KEY
            valueFrom:
              secretKeyRef:
                key: NGC_API_KEY
                name: ngc-api-key-secret
          - name: NVIDIA_VISIBLE_DEVICES
            value: '4'
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

nemo-rerank:
  applicationSpecs:
    ranking-deployment:
      containers:
        ranking-container:
          env:
          - name: NGC_API_KEY
            valueFrom:
              secretKeyRef:
                key: NGC_API_KEY
                name: ngc-api-key-secret
          - name: NVIDIA_VISIBLE_DEVICES
            value: '4'
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

riva:
  enabled: true
  applicationSpecs:
    riva-deployment:
      containers:
        riva-container:
          env:
          - name: NGC_API_KEY
            valueFrom:
              secretKeyRef:
                key: NGC_API_KEY
                name: ngc-api-key-secret
          - name: NIM_HTTP_API_PORT
            value: '9000'
          - name: NIM_GRPC_API_PORT
            value: '50051'
          - name: NIM_TAGS_SELECTOR
            value: name=parakeet-0-6b-ctc-riva-en-us,mode=all
          - name: NVIDIA_VISIBLE_DEVICES
            value: '4'
  resources:
    limits:
      nvidia.com/gpu: 0

Things to note in the above overrides file:

  • The overrides file assumes a specific GPU topology (documented in the beginning of the overrides file)

  • The setting nvidia.com/gpu: 0 for all microservices is 0 (means no limit in GPU allocation).

  • GPU allocation to each microservice is handled using env NVIDIA_VISIBLE_DEVICES.

  • riva microservice is enabled and configured to use GPU 4 which is shared with VSS, embedding and reranking pods.

  • riva microservice is configured to use parakeet-0-6b-ctc-riva-en-us model.

  • VSS has audio enabled by setting ENABLE_AUDIO to true.

  • ENABLE_RIVA_SERVER_READINESS_CHECK is set to true. This enables readiness check for Riva ASR server at VSS startup.

  • VSS has proprietary codecs enabled by setting INSTALL_PROPRIETARY_CODECS to true. This will install additional opensource and proprietary codecs. Please review their license terms. This is required for additional audio codecs support.

Installing the additional codecs and opensource packages requires root permissions, this can be done by setting the following in the overrides file to run the container as root. Alternatively, a Custom Container Image with Codecs Installed can be used.

...
vss:
  applicationSpecs:
    vss-deployment:
      securityContext:
        fsGroup: 0
        runAsGroup: 0
        runAsUser: 0
...

Note

This has been tested with 8 x H100 GPUs.

To apply the overrides file while deploying the VSS Helm Chart, run:

sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.3.0.tgz \
--set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml

Next, Wait for VSS to be ready and then Launch VSS UI.

Enabling CV Pipeline: Set-Of-Marks (SOM) & Metadata#

The following overrides are required to enable the CV pipeline:

# 4 + 3 + 1 with GPU Sharing topology example
# in a host with 8xH100
# nim-llm on GPU 0,1,2,3
# vss(VLM) on GPU 5,6,7
# embedding on GPU 4
# reranking on GPU 4


nim-llm:
  env:
  - name: NVIDIA_VISIBLE_DEVICES
    value: "0,1,2,3"
  - name: NIM_MAX_MODEL_LEN
    value: "128000"
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit


vss:
  applicationSpecs:
    vss-deployment:
      securityContext:
        fsGroup: 0
        runAsGroup: 0
        runAsUser: 0
      containers:
        vss:
          env:
          - name: VLM_MODEL_TO_USE
            value: vila-1.5 # Or "openai-compat" or "custom" or "nvila"
          - name: MODEL_PATH
            value: "ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8"
          - name: NVIDIA_VISIBLE_DEVICES
            value: "5,6,7"
          - name: INSTALL_PROPRIETARY_CODECS
            value: "true"
          - name: DISABLE_CV_PIPELINE
            value: "false"
          - name: GDINO_INFERENCE_INTERVAL
            value: "1"
          - name: NUM_CV_CHUNKS_PER_GPU
            value: "2"
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit



nemo-embedding:
  applicationSpecs:
    embedding-deployment:
      containers:
        embedding-container:
          env:
          - name: NGC_API_KEY
            valueFrom:
              secretKeyRef:
                key: NGC_API_KEY
                name: ngc-api-key-secret
          - name: NVIDIA_VISIBLE_DEVICES
            value: '4'
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit


nemo-rerank:
  applicationSpecs:
    ranking-deployment:
      containers:
        ranking-container:
          env:
          - name: NGC_API_KEY
            valueFrom:
              secretKeyRef:
                key: NGC_API_KEY
                name: ngc-api-key-secret
          - name: NVIDIA_VISIBLE_DEVICES
            value: '4'
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

Things to note in the above overrides file:

  • The overrides file assumes a specific GPU topology (documented in the beginning of the overrides file)

  • The setting nvidia.com/gpu: 0 for all microservices is 0 (means no limit in GPU allocation).

  • GPU allocation to each microservice is handled using env NVIDIA_VISIBLE_DEVICES.

  • VSS has CV pipeline enabled by setting DISABLE_CV_PIPELINE to false. This will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.

  • VSS has proprietary codecs enabled by setting INSTALL_PROPRIETARY_CODECS to true. This will install additional opensource and proprietary codecs. Please review their license terms. This is required for Set-Of-Marks overlay preview.

Installing the additional codecs and opensource packages requires root permissions, this can be done by setting the following in the overrides file to run the container as root. Alternatively, a Custom Container Image with Codecs Installed can be used.

...
vss:
  applicationSpecs:
    vss-deployment:
      securityContext:
        fsGroup: 0
        runAsGroup: 0
        runAsUser: 0
...

Note

This has been tested with 8 x H100 GPUs.

To apply the overrides file while deploying the VSS Helm Chart, run:

sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.3.0.tgz \
    --set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml

Next, Wait for VSS to be ready and then Launch VSS UI.

Multi-Node Deployment#

Multi-node deployments can be used in case where more resources (e.g. GPUs) are required than available on a single node. For example, a 8 x LLM GPUs + 6 x VLM GPUs + 1 x Embedding + 1 x Reranking topology can be deployed on 2 8xH100 nodes.

While services can be distributed across multiple nodes, each individual service and its associated containers must run entirely on a single node that has sufficient resources. Services cannot be split across multiple nodes to utilize their combined resources.

By default, Kubernetes schedules the pods based on resource availability automatically.

To explicitly schedule a particular pod on a particular node, run the following command. This command schedules the embedding and reranking services on the second node.

sudo microk8s kubectl get node

helm install vss-blueprint nvidia-blueprint-vss-2.3.0.tgz \
   --set global.ngcImagePullSecretName=ngc-docker-reg-secret \
   --set vss.nodeSelector."kubernetes\.io/hostname"="<Name of Node #2>"  \
   --set nemo-embedding.nodeSelector."kubernetes\.io/hostname"="<Name of Node #2>"  \
   --set nemo-rerank.nodeSelector."kubernetes\.io/hostname"="<Name of Node #2>"

This can be done for any pod by adding the nodeSelector to the overrides file.

...
<service-name>:
  nodeSelector:
    "kubernetes.io/hostname": "<Name of Node #2>"
...

Note

If you have issues with multi-node deployment, try the following:

  • Try setting nodeSelector on each service as shown above when deploying.

  • Try deleting existing PVCs sudo microk8s kubectl delete pvc model-store-vss-blueprint-0 vss-ngc-model-cache-pvc before redeploying.