CV Pipeline (Alpha)#

CV Pipeline in VSS is responsible for generating CV metadata for videos and live streams. CV metadata consists of detailed information about the objects present in the video like the object position, mask, tracking ID etc. CV pipeline generates this metadata using an object detector and a tracker. The tracker also generates and tracks object masks along with the object positions.

The CV metadata generated by the CV pipeline is utilized to improve the accuracy of Video Search and Summarization in two ways:

The CV metadata is used by the data processing pipeline to generate inputs for VLM with overlaid object ID, masks etc. This helps in improving the accuracy of VLM as well as enables Set of Marks prompting.
The CV metadata is attached with VLM generated dense captions and passed to retrieval pipeline for further processing and indexing.

Please refer to Enabling CV Pipeline: Set-Of-Marks (SOM) & Metadata with Helm / Docker Compose for instructions on enabling the CV pipeline during VSS initialization. Once initialized, users can choose to enable or disable the CV pipeline for individual summarization requests through the UI (see UI Application).

CV pipeline also supports customization. Please refer to CV Pipeline Customization for instructions.

CV pipeline processing#

Video files are processed as chunks (Optimal chunk size is decided based on the number of GPU resources available and the duration of the video). The chunks represent some portion of the video file. Processing of individual chunks is distributed across the GPUs in parallel for better performance.
Each chunk is processed by CV pipeline, which involves object detection by a zero shot object detector like Grounding DINO followed by tracking of the detected objects using Mask Tracker.
- The zero-shot object detector identifies objects based on text prompts. You can specify multiple object classes to detect by separating them with dots, for example: vehicle . truck. Additionally, you can specify a detection confidence score threshold in the prompt after a semicolon, for example: vehicle . truck;0.5. Note that there can’t be any space before or after the semicolon.
- Mask tracker tracks the objects as well as generates and tracks masks of the objects.
Once the processing of chunks is complete and CV metadata is generated for all the chunks, the metadata from all the chunks is fused together using metadata fusion module to get the CV metadata for the complete input video.
The fused CV metadata is then passed to the data processing pipeline which generates the CV metadata overlaid input frames for VLM.

The fused CV metadata is also combined with VLM dense captions and passed to the retrieval pipeline for further processing and indexing. Here’s an example of video summarization and Q&A response using the CV metadata:

Emergency Response and Traffic Blockage (90.0-109.5 seconds)

A red car (20) and a yellow car (21) are stopped in the intersection, appearing to have been involved in a collision.
A police car (22) approaches the intersection and stops near the red and yellow cars, responding to the incident with flashing lights.
A collision occurs at the intersection involving three vehicles: the red car (20), the yellow car (21), and the police car (22).
The vehicles remain stationary and in contact with each other throughout the sequence.

In case of live streams, since the processing is real time, the CV metadata is generated in data processing pipeline itself instead of a separate CV pipeline.

Note

CV pipeline is currently supported for video files and live-streams only. Images are not supported.

Note

The object detector (Grounding DINO) uses the Swin-Tiny backbone by default, which prioritizes speed over accuracy. It is expected if you see inaccurate detection results, especially for uncommon object classes like “forklift”.

To improve the accuracy, you can try the following:

Adjust the detection confidence threshold in your prompt. For example, if you see false positives, you can increase the threshold (e.g., vehicle . truck;0.5). Note that the default threshold is 0.3.
Switch to a more accurate backbone like SwinB (see CV Pipeline Customization for instructions).