Deploy the Blueprint#

Create Required Secrets#

To deploy the secrets required by the VSS Blueprint:

Note

If using microk8s, prepend the kubectl commands with sudo microk8s. For example, sudo microk8s kubectl .... To join the group for admin access, avoid using sudo, and other information about microk8s setup/usage, review: https://microk8s.io/docs/getting-started. If not using microk8s, you may use kubectl directly. For example, kubectl get pod.

# Create the NGC image pull secret

sudo microk8s kubectl create secret docker-registry ngc-docker-reg-secret --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password=$NGC_API_KEY

# Create the neo4j db credentials secret

sudo microk8s kubectl create secret generic graph-db-creds-secret --from-literal=username=neo4j --from-literal=password=password

# Create NGC Secret

sudo microk8s kubectl create secret generic ngc-api-key-secret --from-literal=NGC_API_KEY=$NGC_API_KEY

Deploy the Helm Chart#

To deploy the VSS Blueprint Helm Chart:

# Fetch the VSS Blueprint Helm Chart

sudo microk8s helm fetch https://helm.ngc.nvidia.com/nvidia/blueprint/charts/nvidia-blueprint-vss-2.2.0.tgz --username='$oauthtoken' --password=$NGC_API_KEY

# Install the Helm Chart

sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.2.0.tgz --set global.ngcImagePullSecretName=ngc-docker-reg-secret

Note

When running on an L40 / L40S system, the default startup probe timeout might not be enough for the VILA model to be downloaded and its TRT-LLM engine to be built. To prevent startup timeout, increase the startup timeout by passing --set vss.applicationSpecs.vss-deployment.containers.vss.startupProbe.failureThreshold=360 to the helm install command or the following to the overrides file.

vss:
  applicationSpecs:
    vss-deployment:
      containers:
        vss:
          startupProbe:
            failureThreshold: 360

Note

This project downloads and installs additional third-party open source software projects. Review the license terms of these open source projects before use.

By default, Kubernetes schedules the pods based on resource availability. To explicitly schedule a particular pod on a particular node, run the following command. This command schedules the embedding and reranking services on the second node.

sudo microk8s kubectl get node

helm install vss-blueprint nvidia-blueprint-vss-2.2.0.tgz \
   --set global.ngcImagePullSecretName=ngc-docker-reg-secret \
   --set nemo-embedding.nodeSelector."kubernetes\.io/hostname"="<Name of Node #2>"  \
   --set nemo-rerank.nodeSelector."kubernetes\.io/hostname"="<Name of Node #2>"

Wait for all services to be up. This may take some time (5-15 minutes or more) depending on the setup. Make sure all pods are in Running or Completed STATUS and shows 1/1 as READY. You can monitor the services using the following command:

sudo watch microk8s kubectl get pod

To make sure the VSS UI is ready and accessible, please check logs for deployment using command:

sudo microk8s kubectl logs vss-vss-deployment-POD-NAME

Please make sure the below logs are present and user does not see any errors:

Application startup complete.
Uvicorn running on http://0.0.0.0:9000

Default Deployment Topology and Models in Use#

The default deployment topology is as follows.

This is the topology that you see when checking deployment status using sudo microk8s kubectl get pod.

Microservice/Pod	Description	Default #GPU Allocation
vss-blueprint-0	The NIM LLM (llama-3.1).	4
vss-vss-deployment	VSS Ingestion Pipeline (VLM: default is VILA1.5) + Retrieval Pipeline (CA RAG/LLM: llama3.1 + nemo embedding + nemo reranking)	2
nemo-embedding-embedding-deployment	NeMo Embedding model used in Retrieval Pipeline	1
nemo-rerank-ranking-deployment	NeMo Reranking model used in Retrieval Pipeline	1
etcd-etcd-deployment, milvus-milvus-deployment, neo4j-neo4j-deployment,	Milvus and Neo4j databases used for vector and graph storage.	NA

Launch VSS UI#

Find the ports where the VSS REST API and UI server are running:

sudo microk8s kubectl get svc vss-service

vss-service  NodePort <CLUSTER_IP> <none>  8000:32114/TCP,9000:32206/TCP  12m

The NodePorts corresponding to port 8000 and 9000 are for the REST API (VSS_API_ENDPOINT) and the UI respectively. The VSS UI is available at the NodePort corresponding to 9000. In this example, the VSS UI can be accessed by opening http://<NODE_IP>:32206 in a browser. VSS REST API address is http://<NODE_IP>:32114.

Note

The <NODE_IP> is the IP address of the machine where the Helm Chart or pod by the name: vss-vss-deployment is deployed.

Next, test the deployment by summarizing a sample video. Continue to Configuration Options and VSS Customization when you are ready to customize VSS.

If running into any errors, please refer to the sections FAQ and Known Issues.

Configuration Options#

Some options in the Helm Chart can be configured using the Helm overrides file.

An example of the overrides.yaml file:

vss:
  applicationSpecs:
    vss-deployment:
      containers:
        vss:
          image:
            repository: nvcr.io/nvidia/blueprint/vss-engine
            tag: 2.2.0 # Update to override with custom VSS image
          env:
            - name: VLM_MODEL_TO_USE
              value: vila-1.5 # Or "openai-compat" or "custom" or "nvila"
            # Specify path in case of VILA-1.5 and custom model. Can be either
            # a NGC resource path or a local path. For custom models this
            # must be a path to the directory containing "inference.py" and
            # "manifest.yaml" files.
            - name: MODEL_PATH
              value: "ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8"
            - name: DISABLE_GUARDRAILS
              value: "false" # "true" to disable guardrails.
            - name: TRT_LLM_MODE
              value: ""  # int4_awq (default), int8 or fp16. (for VILA only)
            - name: VLM_BATCH_SIZE
              value: ""  # Default is determined based on GPU memory. (for VILA only)
            - name: VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME
              value: ""  # Set to use a VLM exposed as a REST API with OpenAI compatible API (e.g. gpt-4o)
            - name: VIA_VLM_ENDPOINT
              value: ""  # Default OpenAI API. Override to use a custom API
            - name: VIA_VLM_API_KEY
              value: ""  # API key to set when calling VIA_VLM_ENDPOINT
            - name: OPENAI_API_VERSION
              value: ""
            - name: AZURE_OPENAI_API_VERSION
              value: ""

  resources:
    limits:
      nvidia.com/gpu: 2   # Set to 8 for 2 x 8H100 node deployment
  # nodeSelector:
  #   kubernetes.io/hostname: <node-1>

nim-llm:
  resources:
    limits:
      nvidia.com/gpu: 4
  # nodeSelector:
  #   kubernetes.io/hostname: <node-2>

nemo-embedding:
  resources:
    limits:
      nvidia.com/gpu: 1  # Set to 2 for 2 x 8H100 node deployment
  # nodeSelector:
  #   kubernetes.io/hostname: <node-2>

nemo-rerank:
  resources:
    limits:
      nvidia.com/gpu: 1  # Set to 2 for 2 x 8H100 node deployment
  # nodeSelector:
  #   kubernetes.io/hostname: <node-2>

To apply the overrides file while deploying the VSS Helm Chart, run:

sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.2.0.tgz --set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml

Note

If you have issues with multi-node deployment, try the following:

Try setting nodeSelector on each service as shown above when deploying.
Try deleting existing PVCs sudo microk8s kubectl delete pvc model-store-vss-blueprint-0 vss-ngc-model-cache-pvc before redeploying.

Optional Deployment Topology with GPU sharing#

You can achieve below deployement topology by applying the Helm overides file.

Note

This is only for H100 / A100 (80+ GB device memory).

Microservice/Pod	Description	Default #GPU Allocation
vss-blueprint-0	The NIM LLM (llama-3.1).	4 (index: 0,1,2,3)
vss-vss-deployment	VSS Ingestion Pipeline (VLM: default is VILA1.5) + Retrieval Pipeline (CA RAG/LLM: llama3.1 + nemo embedding + nemo reranking)	4 (index: 4,5,6,7)
nemo-embedding-embedding-deployment	NeMo Embedding model used in Retrieval Pipeline	1 (index: 4)
nemo-rerank-ranking-deployment	NeMo Reranking model used in Retrieval Pipeline	1 (index: 4)
etcd-etcd-deployment, milvus-milvus-deployment, neo4j-neo4j-deployment,	Milvus and Neo4j databases used for vector and graph storage.	NA

The Helm overrides.yaml file to change deployment topology with GPU sharing.

# 4 + 4 with GPU Sharing topology example
# in a host with 8xH100
# nim-llm on GPU 0,1,2,3
# vss(VLM) on GPU 4,5,6,7
# embedding on GPU 4
# reranking on GPU 4

nim-llm:
  env:
  - name: NVIDIA_VISIBLE_DEVICES
    value: "0,1,2,3"
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

vss:
  applicationSpecs:
    vss-deployment:
      containers:
        vss:
          env:
          - name: VLM_MODEL_TO_USE
            value: vila-1.5 # Or "openai-compat" or "custom" or "nvila"
          - name: MODEL_PATH
            value: "ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8"
          - name: DISABLE_GUARDRAILS
            value: "false" # "true" to disable guardrails.
          - name: TRT_LLM_MODE
            value: ""  # int4_awq (default), int8 or fp16. (for VILA only)
          - name: VLM_BATCH_SIZE
            value: ""  # Default is determined based on GPU memory. (for VILA only)
          - name: VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME
            value: ""  # Set to use a VLM exposed as a REST API with OpenAI compatible API (e.g. gpt-4o)
          - name: VIA_VLM_ENDPOINT
            value: ""  # Default OpenAI API. Override to use a custom API
          - name: VIA_VLM_API_KEY
            value: ""  # API key to set when calling VIA_VLM_ENDPOINT
          - name: OPENAI_API_VERSION
            value: ""
          - name: AZURE_OPENAI_API_VERSION
            value: ""
          - name: NVIDIA_VISIBLE_DEVICES
            value: "4,5,6,7"
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit


nemo-embedding:
  applicationSpecs:
    embedding-deployment:
      containers:
        embedding-container:
          env:
          - name: NGC_API_KEY
            valueFrom:
              secretKeyRef:
                key: NGC_API_KEY
                name: ngc-api-key-secret
          - name: NVIDIA_VISIBLE_DEVICES
            value: '4'
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

nemo-rerank:
  applicationSpecs:
    ranking-deployment:
      containers:
        ranking-container:
          env:
          - name: NGC_API_KEY
            valueFrom:
              secretKeyRef:
                key: NGC_API_KEY
                name: ngc-api-key-secret
          - name: NVIDIA_VISIBLE_DEVICES
            value: '4'
  resources:
    limits:
      nvidia.com/gpu: 0    # no limit

To apply the overrides file while deploying the VSS Helm Chart, run:

sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.2.0.tgz --set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml

Note

Limitation

It bypass the k8s device plugin resource allocation.
Helm chart might not work in managed k8s cluster services(e.g. AKS) where this env is not allowed.
For k8s where CDI is enabled this variable is ignored and GPUs are allocated randomly. So won’t work as expected.