VSS Customization#

How to customize VSS deployment#

There are many ways to customize the VSS blueprint before deployment:

Use Configuration Options to override the default deployment parameters (Helm).
Update the the values.yaml file of different subcharts of VSS blueprint (Helm).
Use various deployment scenarios provided in Deploy Using Docker Compose.
Modify VSS deployment time configurations: VSS Deployment-Time Configuration Glossary.
Customize the VSS container image with required implementations and use the updated image in Helm Chart / Docker Compose.
Modify the VSS application source code, build a new container image and use in the VSS deployment.

API Configurable Parameters#

At runtime, the /summarize API supports the following parameters. Refer to the API schema for details. You can refer to the VSS ENDPOINT APIs Glossary for complete API details, or check the API schema at http://<VSS_API_ENDPOINT>/docs after VSS is deployed.

temperature, seed, top_p, max_tokens, top_k (VLM sampling parameters)
num_frames_per_chunk, vlm_input_width, vlm_input_height (VLM frame input configuration)
chunk_duration (Duration in seconds to break the video into chunks for generating captions using VLM)
chunk_overlap_duration (Duration in seconds for overlap between adjacent chunks)
summary_duration (Live stream only - Duration in seconds of streaming video to generate summary for. Summaries will be generated every summary_duration seconds)
prompt (VLM prompt)
caption_summarization_prompt and summary_aggregation_prompt (Summarization prompts)
enable_chat (Enable Q&A - enables graph / vector DB ingestion)
enable_cv_metadata, cv_pipeline_prompt (CV pipeline)
enable_audio
summarize_batch_size,
rag_type (graph-rag or vector-rag), rag_top_k, rag_batch_size (RAG configuration)
summarize_batch_size, summarize_max_tokens, summarize_temperature, summarize_top_p (Summarization LLM parameters)
chat_max_tokens, chat_temperature, chat_top_p (Q&A LLM parameters)
notification_max_tokens, notification_temperature, notification_top_p (Alert LLM Parameters)

Tuning Prompts#

VLM prompts need to be specific to a use case. The prompt must include specific events that need to be found out. The summarization prompts used by CA-RAG also need to be tuned for the use case. The three prompts can be specified in the /summarize API.

Warehouse prompt configuration example:

Prompt Type	Example Prompt	Guidelines
caption	“Write a concise and clear dense caption for the provided warehouse video, focusing on irregular or hazardous events such as boxes falling, workers not wearing PPE, workers falling, workers taking photographs, workers chitchatting, or forklift stuck. Start and end each sentence with a time stamp.”	This is the prompt to VLM. Make sure you provide keywords necessary to aid image/video understanding. Call out the types of events you want the VLM to detect. Example: Anomaly like a person not wearing PPE. This prompt can be updated using VIA Gradio UI: “PROMPT (OPTIONAL)” field. See: UI application for screenshots. If you enabled CV pipeline and want to use IDs in event descriptions, add “use IDs in event description” to the prompt.
caption_summarization	“You should summarize the following events of a warehouse in the format start_time:end_time:caption. If during a time segment only regular activities happen, then ignore them, else note any irregular activities in detail. The output should be bullet points in the format start_time:end_time: detailed_event_description. Don’t return anything else except the bullet points.”	This prompt is used by CA RAG to summarize captions generated by VLM. This is the first step in a two-step summarization task. Change it according to your use case.
summary_aggregation	“You are a warehouse monitoring system. Given the caption in the form start_time:end_time: caption, Aggregate the following captions in the format start_time:end_time:event_description. The output should only contain bullet points. Cluster the output into Unsafe Behavior, Operational Inefficiencies, Potential Equipment Damage, and Unauthorized Personnel.”	This prompt is used by CA-RAG to generate the final summary. Change it according to your use case.
caption (prompt for JSON output)	“Find out all the irregular or hazardous events such as boxes falling, workers not wearing PPE, workers falling, workers taking photographs, workers chitchatting, or forklift stuck. Fill the following JSON format with the event information: { “all_events”: [ {“event”: “<event caption>”, “start_time”: <start time of event>, “end_time”: <end time of the event>}]}. Reply only with JSON output.”	This is the prompt to the VLM. You can change the JSON format to suit your use case.
cv_pipeline_prompt	“vehicle . truck”	This prompt is used by the zero shot object detector (i.e. Grounding DINO) in CV Pipeline. Make sure you provide keywords necessary to detect the intended objects seperated by a dot (.). You can also specify a detection confidence score threshold in the prompt after a semicolon, e.g. “vehicle . truck;0.5”. Note that there can’t be any space before or after the semicolon.

Accessing Milvus Vector DB#

VSS uses Milvus vector DB to store the intermediate VLM responses per chunk before aggregating and summarizing the responses using CA-RAG.

VSS blueprint deploys a Milvus vector DB.

For helm, you can access the Milvus vector DB by updating the Milvus service to be a NodePort:

kubectl patch svc milvus-milvus-deployment-milvus-service -p '{"spec":{"type":"NodePort"}}'

kubectl get svc milvus-milvus-deployment-milvus-service # Get the Nodeport

milvus-milvus-deployment-milvus-service   NodePort   10.152.183.217   <none>        19530:30723/TCP,9091:31482/TCP   96m

Note

If using microk8s, prepend the kubectl commands with sudo microk8s. For example, sudo microk8s kubectl ....

For docker compose, milvus is started internally in the VSS container and is not exposed as a service.

The Milvus service can be accessed by connecting to <NODE_IP>:30723 in this case. You can use standard Milvus tools like milvus_cli or Milvus Python SDK to interact with the Milvus DB.

VSS stores the per chunk metadata and the per chunk VLM response in the vector DB. The VLM response is stored as a string, because it is and it is not parsed or stored as structured data. The metadata stores include the start and end times of the chunk, and chunk index. The final aggregated summarization response from CA-RAG is not stored.

Custom Post-Processing Functions#

The output of VLM is stored in a Milvus vector DB. To implement custom post-processing functions, you can connect to the Milvus vector DB and use the information stored in it. For details refer to Accessing Milvus Vector DB.

CA-RAG Configuration#

VSS CA-RAG can be configured using a config file.

Here’s an example configuration for Summarization:

summarization:
   enable: true
   method: "batch"
   llm:
      model: "meta/llama-3.1-70b-instruct"
      base_url: "http://localhost:8000/v1"
      max_tokens: 2048
      temperature: 0.2
      top_p: 0.7
   embedding:
      model: "nvidia/llama-3.2-nv-embedqa-1b-v2"
      base_url: "http://localhost:8000/v1"
   params:
      batch_size: 5
   prompts:
      caption: "Write a concise and clear dense caption for the provided warehouse video, focusing on irregular or hazardous events such as boxes falling, workers not wearing PPE, workers falling, workers taking photographs, workers chitchatting, forklift stuck, etc. Start and end each sentence with a time stamp."
      caption_summarization: "You should summarize the following events of a warehouse in the format start_time:end_time:caption. For start_time and end_time use . to separate seconds, minutes, hours. If during a time segment only regular activities happen, then ignore them, else note any irregular activities in detail. The output should be bullet points in the format start_time:end_time: detailed_event_description. Don't return anything else except the bullet points."
      summary_aggregation: "You are a warehouse monitoring system. Given the caption in the form start_time:end_time: caption, Aggregate the following captions in the format start_time:end_time:event_description. If the event_description is the same as another event_description, aggregate the captions in the format start_time1:end_time1,...,start_timek:end_timek:event_description. If any two adjacent end_time1 and start_time2 is within a few tenths of a second, merge the captions in the format start_time1:end_time2. The output should only contain bullet points.  Cluster the output into Unsafe Behavior, Operational Inefficiencies, Potential Equipment Damage and Unauthorized Personnel"

The meaning of the attributes is as follows:

enable: Enables the summarization. Default: true
method: Can be batch or refine. Refer to Summarization for more details about each method. Default: batch
batch_size: For method batch, this is the batch size used for combining a batch summary. Default: 5

Here’s an example configuration for Q&A:

chat:
   rag: graph-rag # graph-rag or vector-rag
   params:
      batch_size: 1
      top_k: 5
      multi_channel: true # Enable/Disable multi-stream processing.
      chat_history: false # Enable/Disable chat history.
   llm:
      model: "gpt-4o"
      max_tokens: 2048
      temperature: 0.2
      top_p: 0.7
   embedding:
      model: "nvidia/llama-3.2-nv-embedqa-1b-v2"
      base_url: "http://localhost:8000/v1"
   reranker:
      model: "nvidia/llama-3.2-nv-rerankqa-1b-v2"
      base_url: "http://localhost:8000/v1"

Where attributes are:

rag: Can be graph-rag or vector-rag. Refer to Data Retrieval (QnA) for more details for each option. Default graph-rag
batch_size: Number of vlm captions to be batched together for creating graph.
top_k: top-k most relevant retrieval results for QnA.
multi_channel: Enable/Disable multi-stream processing. Default false. Only supported for graph-rag.
chat_history: Enable/Disable chat history. Default true. Only supported for graph-rag.

Here’s an example configuration for Alerts:

notification:
  enable: true
  endpoint: "http://127.0.0.1:60000/via-alert-callback"
  llm:
    model: "meta/llama-3.1-70b-instruct"
    base_url: "http://<IP ADDRESS>:<PORT>/v1/"
    max_tokens: 2048
    temperature: 0.2
    top_p: 0.7

Here’s an example ca_rag_config.yaml file:

---
summarization:
  enable: true
  method: "batch"
  llm:
    model: "meta/llama-3.1-70b-instruct"
    base_url: "http://<IP ADDRESS>:<PORT>/v1/"
    max_tokens: 2048
    temperature: 0.2
    top_p: 0.7
  embedding:
    model: "nvidia/llama-3.2-nv-embedqa-1b-v2"
    base_url: "http://<IP ADDRESS>:<PORT>/v1/"
  params:
    batch_size: 6 # Use even batch size if speech recognition enabled.
  prompts:
    caption: "Write a concise and clear dense caption for the provided warehouse video, focusing on irregular or hazardous events such as boxes falling, workers not wearing PPE, workers falling, workers taking photographs, workers chitchatting, forklift stuck, etc. Start and end each sentence with a time stamp."
    caption_summarization: "You should summarize the following events of a warehouse in the format start_time:end_time:caption. For start_time and end_time use . to seperate seconds, minutes, hours. If during a time segment only regular activities happen, then ignore them, else note any irregular activities in detail. The output should be bullet points in the format start_time:end_time: detailed_event_description. Don't return anything else except the bullet points."
    summary_aggregation: "You are a warehouse monitoring system. Given the caption in the form start_time:end_time: caption, Aggregate the following captions in the format start_time:end_time:event_description. If the event_description is the same as another event_description, aggregate the captions in the format start_time1:end_time1,...,start_timek:end_timek:event_description. If any two adjacent end_time1 and start_time2 is within a few tenths of a second, merge the captions in the format start_time1:end_time2. The output should only contain bullet points.  Cluster the output into Unsafe Behavior, Operational Inefficiencies, Potential Equipment Damage and Unauthorized Personnel"

chat:
  rag: graph-rag # graph-rag or vector-rag
  params:
    batch_size: 1
    top_k: 5
  llm:
    model: "meta/llama-3.1-70b-instruct"
    base_url: "http://<IP ADDRESS>:<PORT>/v1/"
    max_tokens: 2048
    temperature: 0.2
    top_p: 0.7
  embedding:
    model: "nvidia/llama-3.2-nv-embedqa-1b-v2"
    base_url: "http://<IP ADDRESS>:<PORT>/v1"
  reranker:
    model: "nvidia/llama-3.2-nv-rerankqa-1b-v2"
    base_url: "http://<IP ADDRESS>:<PORT>/v1/"

notification:
  enable: true
  endpoint: "http://127.0.0.1:60000/via-alert-callback"
  llm:
    model: "meta/llama-3.1-70b-instruct"
    base_url: "http://<IP ADDRESS>:<PORT>/v1/"
    max_tokens: 2048
    temperature: 0.2
    top_p: 0.7

To modify this configuration for helm, update ca_rag_config.yaml in nvidia-blueprint-vss/charts/vss/values.yaml of the VSS Blueprint as required before deploying the Helm Chart. The endpoints are already configured to use the models deployed as part of the Helm Chart.

Overview on the steps:

tar -xzf nvidia-blueprint-vss-2.3.1.tgz.
Open the values.yaml file in an editor of choice: vi nvidia-blueprint-vss/charts/vss/values.yaml
Find the config file content by searching for “ca_rag_config.yaml”. You will see this under configs: section.
Change the CA RAG configurations of interest.
Create the Helm Chart tarball with the updated config: tar -czf nvidia-blueprint-vss-2.3.1.tgz nvidia-blueprint-vss
Deploy the new Helm Chart following instructions at Deploy Using Helm.

For docker compose, update the config.yaml file present in the various deployment scenario directories shown in Deploy Using Docker Compose.

Tuning Guardrails#

VSS supports Guardrails for user input and provides a default Guardrails configuration. VSS uses NVIDIA NeMo Guardrails to provide this functionality.

To modify Guardrails configuration for helm, update guardrails_config.yaml section in nvidia-blueprint-vss/charts/vss/values.yaml file of VSS Helm Chart. Refer to the Nemo Guardrails General instructions to update that section.

Overview on the steps:

tar -xzf nvidia-blueprint-vss-2.3.1.tgz.
Open the values.yaml file in an editor of choice: vi nvidia-blueprint-vss/charts/vss/values.yaml
Find the config file content by searching for “guardrails_config.yaml”. You will see this under configs: section.
Change the Guardrails configurations of interest.
Create the Helm Chart tarball with the updated config: tar -czf nvidia-blueprint-vss-2.3.1.tgz nvidia-blueprint-vss
Deploy the new Helm Chart following instructions at Deploy Using Helm.

For docker compose, update the contents of guardrails directory present in the various deployment scenario directories shown in Deploy Using Docker Compose.

Custom Container Image with Codecs Installed#

To use a custom container image with proprietary codecs installed, you need to build a custom container image that includes the codecs.

Overview on the steps for building the container image:

Create a Dockerfile that installs the proprietary codecs.

FROM nvcr.io/nvidia/blueprint/vss-engine:2.3.1
RUN bash /opt/nvidia/via/user_additional_install.sh

Build the custom container image.

docker build -t <custom_image_name> -f Dockerfile .

Push the custom container image to a container registry.

docker push <custom_image_name>

For deploying the new image using helm, follow the steps below:

Create a new image pull secret for the custom container image repository .

sudo microk8s kubectl create secret docker-registry <secret_name> --docker-server=nvcr.io \
   --docker-username=<username> --docker-password=<password>

Update the overrrides file to use the custom container image.

vss:
  applicationSpecs:
    vss-deployment:
      containers:
        vss:
          image:
            repository: <custom_image_name_repo>
            tag: <custom_image_name_tag>
  imagePullSecrets:
    - name: <secret_name>

Deploy the helm chart with the overrides file.

sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.3.1.tgz \
   --set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml

For deploying the new image using docker compose, follow the steps below:

Add VIA_IMAGE=<custom_image_name> to the .env file.
Run docker compose up to start the VSS deployment.

VSS Source Code#

VSS source code is available at NVIDIA-AI-Blueprints/video-search-and-summarization.

You can follow the steps in source code README to modify the source code and build a new container image.

Follow the steps in deploying custom image.