VSS Customization#

Customize VSS Deployment#

There are many ways to customize the VSS blueprint before deployment:

  • Use Configuration Options to override the default deployment parameters (Helm).

  • Update the the values.yaml file of different subcharts of VSS blueprint (Helm).

  • Use various deployment scenarios provided in Deploy Using Docker Compose X86.

  • Modify VSS deployment time configurations: VSS Deployment-Time Configuration Glossary.

  • Customize the VSS container image with required implementations and use the updated image in Helm Chart or Docker Compose.

  • Modify the VSS application source code, build a new container image and use in the VSS deployment.

API Configurable Parameters#

At runtime, the /summarize API supports the following parameters. Refer to the API schema for details, VSS API Glossary for complete API details, or check the API schema at http://<VSS_API_ENDPOINT>/docs after VSS is deployed.

  • temperature, seed, top_p, max_tokens, top_k (VLM sampling parameters)

  • num_frames_per_chunk, vlm_input_width, vlm_input_height (VLM frame input configuration)

  • chunk_duration (Duration in seconds to break the video into chunks for generating captions using VLM)

  • chunk_overlap_duration (Duration in seconds for overlap between adjacent chunks)

  • summary_duration (Live stream only - Duration in seconds of streaming video for wich to generate a summary. Summaries will be generated every summary_duration seconds)

  • prompt (VLM prompt)

  • caption_summarization_prompt and summary_aggregation_prompt (Summarization prompts)

  • enable_chat (Enable Q&A - enables graph / vector DB ingestion)

  • enable_cv_metadata, cv_pipeline_prompt (CV pipeline)

  • enable_audio

  • summarize_batch_size,

  • summarize_batch_size, summarize_max_tokens, summarize_temperature, summarize_top_p (Summarization LLM parameters)

  • chat_max_tokens, chat_temperature, chat_top_p (Q&A LLM parameters)

  • notification_max_tokens, notification_temperature, notification_top_p (Alert LLM Parameters)

Note

Frame Limits for num_frames_per_chunk Parameter

The maximum number of frames per chunk varies by VLM model:

  • Cosmos-Reason1: Maximum of 128 frames per chunk

  • VILA-1.5: Maximum of 16 frames per chunk

  • NVILA research model: Maximum of 128 frames per chunk

  • OpenAI ChatGPT-4o: Refer to OpenAI API documentation for current limitations on their service. We set a default of 128 frames per chunk but this should depend on the model.

  • Custom models: Maximum of 128 frames per chunk

Exceeding these limits will result in API errors.

Tuning the Input Vision Token Length for Cosmos-Reason1#

The size of the input vision tokens affects the performance of the Cosmos-Reason1 VLM model. Lower the input vision token length, faster the model can process the input. However, it may also affect the accuracy of the model.

The Cosmos-Reason1 VLM model has a maximum vision input token length of 16384. VSS is configured to use the 2K vision token length by default i.e 20 frames per chunk of resolution 532x280.

The size of the input vision tokens is controlled by the resolution of the input frames and the number of frames. The exact relation is:

vision_token_length = ceil(num_frames_per_chunk / 2) * ceil(vlm_input_width / 28) * ceil(vlm_input_height / 28)

Thus it can be controlled by the following parameters in the /summarize and /generate_vlm_captions API as well as setting the defaults as environment variables:

  • num_frames_per_chunk: Number of frames per chunk (that is, per VLM call). VLM_DEFAULT_NUM_FRAMES_PER_CHUNK environment variable can be set to configure the default number of frames per chunk.

  • vlm_input_width: Width of the input frames to the VLM. VLM_INPUT_WIDTH environment variable can be set to configure the default width of the input frames to the VLM.

  • vlm_input_height: Height of the input frames to the VLM. VLM_INPUT_HEIGHT environment variable can be set to configure the default height of the input frames to the VLM.

Sample configuration values for various vision token lengths:

Vision Token Length

Number of Frames per Chunk

Input Resolution

2K (default)

20

532 x 280

4K

20

756 x 392

8K

20

1036 x 588

16K

20

> 1484 x 840

For higher input resolutions that may result in exceeding the maximum vision token length, the frames will automatically be scaled down to a lower resolution to keep the vision token length within the maximum limit.

Tuning Prompts#

VLM prompts need to be specific to a use case. The prompt must include specific events that need to be found out. The summarization prompts used by CA-RAG also need to be tuned for the use case. The three prompts can be specified in the /summarize API.

Warehouse prompt configuration example:

Prompt Type

Example Prompt

Guidelines

caption

“Write a concise and clear dense caption for the provided warehouse video, focusing on irregular or hazardous events such as boxes falling, workers not wearing PPE, workers falling, workers taking photographs, workers chitchatting, or forklift stuck. Start and end each sentence with a time stamp.”

  1. This is the prompt to VLM.

  2. Make sure you provide keywords necessary to aid image/video understanding.

  3. Call out the types of events you want the VLM to detect. Example: Anomaly like a person not wearing PPE.

  4. This prompt can be updated using VIA Gradio UI: “PROMPT (OPTIONAL)” field. Refer to: UI application for screenshots.

  5. If you enabled CV pipeline and want to use IDs in event descriptions, add “use IDs in event description” to the prompt.

caption_summarization

“You should summarize the following events of a warehouse in the format start_time:end_time:caption. If during a time segment only regular activities happen, then ignore them, else note any irregular activities in detail. The output should be bullet points in the format start_time:end_time: detailed_event_description. Don’t return anything else except the bullet points.”

  1. This prompt is used by CA RAG to summarize captions generated by VLM.

  2. This is the first step in a two-step summarization task.

  3. Change it according to your use case.

summary_aggregation

“You are a warehouse monitoring system. Given the caption in the form start_time:end_time: caption, Aggregate the following captions in the format start_time:end_time:event_description. The output should only contain bullet points. Cluster the output into Unsafe Behavior, Operational Inefficiencies, Potential Equipment Damage, and Unauthorized Personnel.”

  1. This prompt is used by CA-RAG to generate the final summary.

  2. Change it according to your use case.

caption (prompt for JSON output)

“Find out all the irregular or hazardous events such as boxes falling, workers not wearing PPE, workers falling, workers taking photographs, workers chitchatting, or forklift stuck. Fill the following JSON format with the event information: { “all_events”: [ {“event”: “<event caption>”, “start_time”: <start time of event>, “end_time”: <end time of the event>}]}. Reply only with JSON output.”

  1. This is the prompt to the VLM.

  2. You can change the JSON format to suit your use case.

cv_pipeline_prompt

“vehicle . truck”

  1. This prompt is used by the zero shot object detector (that is, Grounding DINO) in CV Pipeline.

  2. Make sure you provide keywords necessary to detect the intended objects seperated by a dot (.).

  3. You can also specify a detection confidence score threshold in the prompt after a semicolon, for example, “vehicle . truck;0.5”

    There can’t be any space before or after the semicolon.

Accessing Milvus Vector DB#

VSS uses Milvus vector DB to store the intermediate VLM responses per chunk before aggregating and summarizing the responses using CA-RAG.

VSS blueprint deploys a Milvus vector DB.

For Helm, you can access the Milvus vector DB by updating the Milvus service to be a NodePort:

kubectl patch svc milvus-milvus-deployment-milvus-service -p '{"spec":{"type":"NodePort"}}'

kubectl get svc milvus-milvus-deployment-milvus-service # Get the Nodeport

milvus-milvus-deployment-milvus-service   NodePort   10.152.183.217   <none>        19530:30723/TCP,9091:31482/TCP   96m

Note

If using microk8s, prepend the kubectl commands with sudo microk8s. For example, sudo microk8s kubectl ....

For Docker Compose, Milvus is started internally in the VSS container and is not exposed as a service.

The Milvus service can be accessed by connecting to <NODE_IP>:30723 in this case. You can use standard Milvus tools like milvus_cli or Milvus Python SDK to interact with the Milvus DB.

VSS stores the per chunk metadata and the per chunk VLM response in the vector DB. The VLM response is stored as a string, because it is and it is not parsed or stored as structured data. The metadata stores include the start and end times of the chunk, and chunk index. The final aggregated summarization response from CA-RAG is not stored.

Custom Post-processing Functions#

The output of VLM is stored in a Milvus vector DB. To implement custom post-processing functions, you can connect to the Milvus vector DB and use the information stored in it. For details refer to Accessing Milvus Vector DB.

CA-RAG Configuration#

VSS CA-RAG can be configured using a config file that follows a modular structure. The configuration is organized into tools and functions.

The configuration has three main sections:

  • tools: Defines various tools like databases, LLMs, embeddings that can be used by the functions

  • functions: Defines the functional components that use the tools

  • context_manager: Specifies which functions are active and will be initialized

For example, the configuration file is organized as follows:

tools:
   ## Tool definitions
functions:
   ## Function definitions
context_manager:
   ## Context manager definitions

Here’s an example configuration for a tool:

tools:
   nvidia_embedding: # This is the embedding model
    type: embedding
    params:
      model: nvidia/llama-3.2-nv-embedqa-1b-v2
      base_url: https://integrate.api.nvidia.com/v1
      api_key: !ENV ${NVIDIA_API_KEY}

  graph_db: # This is a neo4j database
    type: neo4j
    params:
      host: !ENV ${GRAPH_DB_HOST}
      port: !ENV ${GRAPH_DB_PORT}
      username: !ENV ${GRAPH_DB_USERNAME}
      password: !ENV ${GRAPH_DB_PASSWORD}
    tools:
      embedding: nvidia_embedding

Here we define an embedding model called nvidia_embedding and a database called graph_db.

In each tool, we define the type of the tool, the parameters for the tool, and other tools that are used by the tool.

For example, nvidia_embedding is a tool of type embedding and graph_db is a tool of type neo4j.

Tool types are defined by CA-RAG and are used to identify the type of tool. Tool names are user defined and are used to reference the tool in the functions and other tools.

For example, in the graph_db tool, we are using the nvidia_embedding tool to embed the data.

The keyword embedding is defined in the neo4j tool type implementation and will be used to reference the nvidia_embedding tool.

The !ENV prefix is used to reference environment variables.

Here’s an example configuration for a function:

functions:
   retriever_function: # This is the retriever function
      type: graph_retrieval
      params:
         batch_size: 1
         multi_channel: true # Enable/Disable multi-stream processing.
         chat_history: false # Enable/Disable chat history.
      tools:
         llm: chat_llm
         db: graph_db

Here we define a function called retriever_function. It is a function of type graph_retrieval.

The params section is used to define the parameters for the function.

The tools section is used to define the tools that are used by the function.

The keyword llm is defined in the graph_retrieval function type implementation and will be used to reference the chat_llm tool in the retriever_function.

The keyword db is defined in the graph_retrieval function type implementation and will be used to reference the graph_db tool in the retriever_function.

Here’s an example configuration for a context manager:

context_manager:
   functions:
      - retriever_function
      - ## Add more functions here

Here we define the functions that will be added to the context manager.

Only the functions listed in the context manager will be loaded and available. This is so the user can create many functions/tools and only load the ones they need.

Here’s the complete configuration for the CA-RAG we are using in the VSS:

tools:
  graph_db: # This is the database used for the graph-rag retrieval
    type: neo4j
    params:
      host: !ENV ${GRAPH_DB_HOST}
      port: !ENV ${GRAPH_DB_PORT}
      username: !ENV ${GRAPH_DB_USERNAME}
      password: !ENV ${GRAPH_DB_PASSWORD}
    tools:
      embedding: nvidia_embedding

  vector_db: # This is the database used for the vector-rag retrieval
    type: milvus
    params:
      host: !ENV ${MILVUS_DB_HOST}
      port: !ENV ${MILVUS_DB_PORT}
    tools:
      embedding: nvidia_embedding

  chat_llm: # This is the LLM used for the summarization, ingestion, and retriever functions
    type: llm
    params:
      model: meta/llama-3.1-70b-instruct
      base_url: https://integrate.api.nvidia.com/v1
      max_tokens: 4096
      temperature: 0.5
      top_p: 0.7
      api_key: !ENV ${NVIDIA_API_KEY}

  nvidia_embedding: # This is the embedding model
    type: embedding
    params:
      model: nvidia/llama-3.2-nv-embedqa-1b-v2
      base_url: https://integrate.api.nvidia.com/v1
      api_key: !ENV ${NVIDIA_API_KEY}

  nvidia_reranker: # This is the reranker model
    type: reranker
    params:
      model: nvidia/llama-3.2-nv-rerankqa-1b-v2
      base_url: https://ai.api.nvidia.com/v1/retrieval/nvidia/llama-3_2-nv-rerankqa-1b-v2/reranking
      api_key: !ENV ${NVIDIA_API_KEY}

  notification_tool: # This is the notification tool
    type: alert_sse_notifier
    params:
      endpoint: "http://127.0.0.1:60000/via-alert-callback"

functions:
  summarization: # This is the summarization function
    type: batch_summarization
    params:
      batch_size: 6 # This is the batch size used for combining a batch summary
      batch_max_concurrency: 20
      prompts:
        caption: "Write a concise and clear dense caption for the provided warehouse video, focusing on irregular or hazardous events such as boxes falling, workers not wearing PPE, workers falling, workers taking photographs, workers chitchatting, forklift stuck, etc. Start and end each sentence with a time stamp."
        caption_summarization: "You should summarize the following events of a warehouse in the format start_time:end_time:caption. For start_time and end_time use . to seperate seconds, minutes, hours. If during a time segment only regular activities happen, then ignore them, else note any irregular activities in detail. The output should be bullet points in the format start_time:end_time: detailed_event_description. Don't return anything else except the bullet points."
        summary_aggregation: "You are a warehouse monitoring system. Given the caption in the form start_time:end_time: caption, Aggregate the following captions in the format start_time:end_time:event_description. If the event_description is the same as another event_description, aggregate the captions in the format start_time1:end_time1,...,start_timek:end_timek:event_description. If any two adjacent end_time1 and start_time2 is within a few tenths of a second, merge the captions in the format start_time1:end_time2. The output should only contain bullet points.  Cluster the output into Unsafe Behavior, Operational Inefficiencies, Potential Equipment Damage and Unauthorized Personnel"
    tools:
      llm: chat_llm
      db: graph_db

  ingestion_function: # This is the ingestion function
    type: graph_ingestion
    params:
      batch_size: 1 # Number of vlm captions to be batched together for creating graph.
    tools:
      llm: chat_llm
      db: graph_db

  retriever_function: # This is the retriever function
    type: graph_retrieval
    params:
      batch_size: 1
      multi_channel: true # Enable/Disable multi-stream processing.
      chat_history: false # Enable/Disable chat history.
    tools:
      llm: chat_llm
      db: graph_db

  notification: # This is the notification function
    type: notification
    params:
      events: []
    tools:
      llm: chat_llm
      notification_tool: notification_tool

context_manager: # This is the context manager
  functions: # This is the list of functions that will be initialized
    - summarization
    - ingestion_function
    - retriever_function
    - notification

In this configuration structure:

  • Tools section: Defines reusable components like databases (Neo4j, Milvus, Elasticsearch), LLMs (OpenAI, NVIDIA), embeddings, rerankers, and other services. Each tool has a type and configuration parameters.

  • Functions section: Defines the functional components that utilize the tools. Functions reference tools by name and specify which tools they use for different purposes (for example, llm, db, embedding).

  • Context manager: Specifies the functions that are active and will be initialized. Only functions listed in the context manager will be loaded and available.

Key advantages of this structure:

  • Modularity: Tools can be shared across multiple functions

  • Flexibility: Easy to swap different tools (for example, different LLMs) without changing function definitions

  • Environment configuration: Uses environment variables for sensitive information like API keys and host configurations

  • Selective initialization: Only specified functions in the context manager are loaded, improving performance

To see a complete and detailed list of tools, functions, and context manager, refer to the Context-Aware RAG.

To modify this configuration for Helm, update ca_rag_config.yaml in nvidia-blueprint-vss/charts/vss/values.yaml of the VSS Blueprint as required before deploying the Helm Chart. The endpoints are already configured to use the models deployed as part of the Helm Chart.

Overview on the steps:

  1. tar -xzf nvidia-blueprint-vss-2.4.0.tgz.

  2. Open the values.yaml file in an editor of choice: vi nvidia-blueprint-vss/charts/vss/values.yaml

  3. Find the config file content by searching for “ca_rag_config.yaml”. You will observe this under configs: section.

  4. Change the CA RAG configurations of interest.

  5. Create the Helm Chart tarball with the updated config: tar -czf nvidia-blueprint-vss-2.4.0.tgz nvidia-blueprint-vss

  6. Deploy the new Helm Chart following instructions at Deploy Using Helm.

For more details on the CA-RAG helm overrides, refer to Configuring CA-RAG Configuration.

For Docker Compose, update the config.yaml file present in the various deployment scenario directories shown in Deploy Using Docker Compose X86.

NIM Model Profile Optimization#

For optimal performance on specific hardware platforms and GPU topologies, NVIDIA NIMs support hardware-specific model profiles that are optimized for different configurations.

NIM Version Recommendations:

  • All platforms except RTX PRO 6000: Use Llama 3.1 70B with NIM version 1.10.1

  • RTX PRO 6000 only: Use Llama 3.1 70B with NIM version 1.13.1 and apply the corresponding profile overrides from the table below

Profile Overrides for Specific Configurations:

RTX PRO 6000 Blackwell (NIM version 1.13.1):

Platform

Number of GPUs for LLM

Profile ID Override

Profile Name

RTX PRO 6000 Blackwell

4

f51a862830b10eb7d0d2ba51184d176a0a37674fef85300e4922b924be304e2b

tensorrt_llm-rtx6000_blackwell_sv-bf16-tp4-pp1- throughput-2bb5

RTX PRO 6000 Blackwell

2

22bf424d6572fda243f890fe1a7c38fb0974c3ab27ebbbc7e2a2848d7af82bd6

tensorrt_llm-rtx6000_blackwell_sv-nvfp4-tp2-pp1- latency-2bb5

4-GPU Configurations (NIM version 1.10.1):

Platform

Number of GPUs for LLM

Profile ID Override

Profile Name

B200

4

f17543bf1ee65e4a5c485385016927efe49cbc068a6021573d83eacb32537f76

tensorrt_llm-b200-bf16-tp4-pp1-latency-2901:10de-4

H200

4

99142c13a095af184ae20945a208a81fae8d650ac0fd91747b03148383f882cf

tensorrt_llm-h200-bf16-tp4-pp1-latency-2335:10de-4

How to Apply Profile Overrides:

Only apply profile overrides for the specific configurations listed in the tables above.

For Helm deployments, add the profile override to your overrides.yaml file:

nim-llm:
  profile: "<profile_id_from_table>"

Example for B200 4-GPU configuration:

nim-llm:
  profile: "f17543bf1ee65e4a5c485385016927efe49cbc068a6021573d83eacb32537f76"

Example for RTX Pro 6000 Blackwell 4-GPU configuration (with LLM NIM version 1.13.1):

nim-llm:
  profile: "f51a862830b10eb7d0d2ba51184d176a0a37674fef85300e4922b924be304e2b"
  image:
    tag: 1.13.1

For Docker Compose deployments, apply the profile by adding the environment variable to your Docker run command:

docker run -e NIM_MODEL_PROFILE="<profile_id_from_table>" \
    # ... other docker run parameters

Summary:

  • Multi-GPU (70B model): Use Llama 3.1 70B with NIM version 1.10.1 for all platforms except RTX PRO 6000

  • Single GPU (8B model): Use Llama 3.1 8B with NIM version 1.12.0

  • RTX PRO 6000: Use Llama 3.1 70B with NIM version 1.13.1 and apply profile overrides for optimal performance

  • B200 4-GPU: When using NIM version 1.10.1, apply the specific profile override for 4-GPU configurations

Tuning Guardrails#

VSS supports Guardrails for user input and provides a default Guardrails configuration. VSS uses NVIDIA NeMo Guardrails to provide this functionality.

To modify Guardrails configuration for Helm, update the guardrails_config.yaml section in the nvidia-blueprint-vss/charts/vss/values.yaml file of VSS Helm Chart. Refer to the Nemo Guardrails General instructions to update that section.

Overview on the steps:

  1. tar -xzf nvidia-blueprint-vss-2.4.0.tgz.

  2. Open the values.yaml file in an editor of choice: vi nvidia-blueprint-vss/charts/vss/values.yaml

  3. Find the config file content by searching for guardrails_config.yaml. You will observe this under configs: section.

  4. Change the Guardrails configurations of interest.

  5. Create the Helm Chart tarball with the updated config: tar -czf nvidia-blueprint-vss-2.4.0.tgz nvidia-blueprint-vss

  6. Deploy the new Helm Chart following instructions at Deploy Using Helm.

For Docker Compose, update the contents of guardrails directory present in the various deployment scenario directories shown in Deploy Using Docker Compose X86.

Custom Container Image with Codecs Installed#

To use a custom container image with proprietary codecs installed, you need to build a custom container image that includes the codecs.

Overview of the steps for building the container image:

  1. Create a Dockerfile that installs the proprietary codecs.

    FROM nvcr.io/nvidia/blueprint/vss-engine:2.4.0
    RUN bash /opt/nvidia/via/user_additional_install.sh
    
  2. Build the custom container image.

    docker build -t <custom_image_name> -f Dockerfile .
    
  3. Push the custom container image to a container registry.

    docker push <custom_image_name>
    

For deploying the new image using Helm, follow the steps below:

  1. Create a new image pull secret for the custom container image repository .

    sudo microk8s kubectl create secret docker-registry <secret_name> --docker-server=nvcr.io \
       --docker-username=<username> --docker-password=<password>
    
  2. Update the overrrides file to use the custom container image.

    vss:
       applicationSpecs:
          vss-deployment:
          containers:
             vss:
                image:
                repository: <custom_image_name_repo>
                tag: <custom_image_name_tag>
       imagePullSecrets:
          - name: <secret_name>
    
  3. Deploy the Helm chart with the overrides file.

    sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.4.0.tgz \
       --set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml
    

For deploying the new image using Docker Compose, follow the steps below:

  1. Add VIA_IMAGE=<custom_image_name> to the .env file.

  2. Run docker compose up to start the VSS deployment.

Load Dense Captions from File and Avoid Video Ingestion#

Note

This is an advanced use case and is an experimental feature enabled for local development and testing purposes.

Dense captions can be pre-loaded from a file to skip video processing part (decode, VLM, ASR) of the data ingestion pipeline.

Refer to Enable Dense Caption for more details on how to enable dense caption generation during deployment.

With the environment variable ENABLE_DENSE_CAPTION set to true, VSS will load the dense caption file from the file system when its available for the video. This support will work only for local video files added using the VSS API /files [POST].

Pre-requisites:

  • File name: The dense caption file must follow this naming pattern: video-file-name.dc.json

  • Location: Must be placed in the same directory as the corresponding video file.

  • Example: If your video is named its.mp4, the dense caption file should be named its.mp4.dc.json and placed alongside the video file.

Some sample videos and the corresponding dense caption file are stored in the VSS container at /opt/nvidia/via/streams/.

File Format:

The file must contain valid JSON data with the dense caption information.

Example JSON Format:

Each row in the file corresponds to a chunk. This example shows a dense caption file with 2 chunks, each containing 10 seconds of video with 10 frames sampled per chunk.

{"vlm_response": "dense caption for chunk_0 with 10 frames", "frame_times": [10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0], "chunk": {"streamId": "492041cc-9735-4702-bc33-3504d9041351", "chunkIdx": 0, "file": "/opt/nvidia/via/streams/bp_preview/its.mp4", "pts_offset_ns": 0, "start_pts": 10000000000, "end_pts": 20000000000, "start_ntp": "1970-01-01T00:00:10.000Z", "end_ntp": "1970-01-01T00:00:20.000Z", "start_ntp_float": 10.0, "end_ntp_float": 20.0, "is_first": false, "is_last": false}}
{"vlm_response": "dense caption for chunk_1 with 10 frames", "frame_times": [90.0, 91.0, 92.0, 93.0, 94.0, 95.0, 96.0, 97.0, 98.0, 99.0], "chunk": {"streamId": "492041cc-9735-4702-bc33-3504d9041351", "chunkIdx": 1, "file": "/opt/nvidia/via/streams/bp_preview/its.mp4", "pts_offset_ns": 0, "start_pts": 90000000000, "end_pts": 100000000000, "start_ntp": "1970-01-01T00:01:30.000Z", "end_ntp": "1970-01-01T00:01:40.000Z", "start_ntp_float": 90.0, "end_ntp_float": 100.0, "is_first": false, "is_last": false}}

JSON Field Documentation:

Field

Type

Description

Example

vlm_response

string

The dense caption text generated by the Vision Language Model (VLM) for the video chunk. Contains descriptive text about what is happening in the video frames.

“dense caption for chunk_0 with 10 frames”

frame_times

array[float]

Array of timestamps (in seconds) for each frame that was processed by the VLM. Each timestamp represents when a frame was sampled from the video chunk.

[10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0]

chunk

object

Metadata object containing information about the video chunk being processed.

See chunk object fields below

streamId

string

Unique identifier for the video stream being processed. Used to track and associate multiple chunks from the same video.

“492041cc-9735-4702-bc33-3504d9041351”

chunkIdx

integer

Index number of the current chunk within the video stream. Chunks are numbered sequentially starting from 0.

0, 1, 2, etc.

file

string

Full path to the video file being processed within the VSS container.

“/opt/nvidia/via/streams/bp_preview/its.mp4”

pts_offset_ns

integer

Presentation timestamp offset in nanoseconds. Used for video synchronization.

0

start_pts

integer

Start presentation timestamp in nanoseconds for the chunk.

10000000000

end_pts

integer

End presentation timestamp in nanoseconds for the chunk.

20000000000

start_ntp

string

Start timestamp in Network Time Protocol (NTP) format for the chunk.

“1970-01-01T00:00:10.000Z”

end_ntp

string

End timestamp in Network Time Protocol (NTP) format for the chunk.

“1970-01-01T00:00:20.000Z”

start_ntp_float

float

Start timestamp as a float value in seconds for easier processing.

10.0

end_ntp_float

float

End timestamp as a float value in seconds for easier processing.

20.0

is_first

boolean

Indicates whether this is the first chunk in the video stream.

false

is_last

boolean

Indicates whether this is the last chunk in the video stream.

false

VSS Source Code#

VSS source code is available at NVIDIA-AI-Blueprints/video-search-and-summarization.

You can follow the steps in source code README to modify the source code and build a new container image.

Follow the steps in deploying custom image.