Release Notes#

These Release Note describe the key features, software enhancements and improvements, and known issues for the VSS release product package.

VSS 2.3.1#

These are the VSS 2.3.1 Release Notes.

Key Features and Enhancements#

Support for NVIDIA Blackwell B200 GPU
OneClick script support for GCP deployments
Performance improvements for file burst mode

VSS 2.3.0#

These are the VSS 2.3.0 Release Notes.

Key Features and Enhancements#

Support for Audio in Summarization and Q&A
Support for preprocessing a video to generate Set-Of-Marks (SOM) prompting and additional CV metadata for better accuracy
Multi-stream support for Q&A
Gradio UI Improvements

Additional runtime parameters that can be configured through the /summarize API

summarize_top_p, summarize_temperature, summarize_max_tokens

LLM Sampling parameters for summarization.

chat_top_p, chat_temperature, chat_max_tokens

LLM Sampling parameters for Q&A .

notification_top_p, notification_temperature, notification_max_tokens

LLM Sampling parameters for alerts/event detection.

More info here: API Documentation.

New API /alerts/recent to get recent alerts for all live streams.
Stability improvements
Single GPU Deployment

VSS 2.2.0#

These are the VSS 2.2.0 Release Notes. This release is an Engineering Release to introduce some of the new features. This release includes several fixes from the previous VSS releases and additional changes.

Key Features and Enhancements#

Enhanced multi-stream / concurrent mode support
GraphRAG performance improvements.
Support for NVILA research model. More info here: Local NGC Models (VILA & NVILA).

Additional runtime parameters that can be configured through the /summarize API

vlm_input_width, vlm_input_height

Configure the input resolution of the frames to the VLM

num_frames_per_chunk

Configure the number of frames to sample from each chunk

summarize_batch_size

LLM Batch Size for summarization.

rag_type

Choose between “graph-rag” and “vector-rag”

rag_top_k

Number of top rerank results to use during Q&A

rag_batch_size

Number of VLM captions to be batched together for creating graph

More info here: API Documentation.

Compatibility#

The TensorRT version in VSS container has been upgraded requiring new TensorRT engines to be built for the VILA-1.5 model.

Make sure to remove any stale TensorRT engines for VILA-1.5.

For helm deployment, this can be done by:

sudo microk8s kubectl delete pvc vss-ngc-model-cache-pvc

Known Issues#

Multi-session Q&A is not currently supported. Users should try chat only on a single file or live stream at a time. Trying chat on multiple files and/or live streams may lead to incorrect replies. This does not affect summarization.
Gradio UI sometimes becomes unresponsive. This can manifest in ways such as live-stream not being deleted even after clicking on the Delete Live Stream button. VSS REST API can be used as an alternative for live stream deletion in this case.
Sometimes, deleting a live stream does not work.
Current VSS release also supports NVILA research model as the VLM. However, optimizations for the model are under development. Currently it only supports FP16 precision.
Models are trained on specific data/use cases so if tested on other inputs then it might give incorrect results.
VLM Model accuracy: Sometimes time stamps returned are not accurate. Also, it can hallucinate for certain questions. Prompt tuning might be required.
Summarization accuracy: Summarization accuracy is heavily dependent on VLM accuracy. Also, the default configs have been tuned for the warehouse use case. User can supply custom VLM and summarization prompts to the /summarize API.

The following harmless warnings might be seen during VSS application execution. These can be safely ignored.

GLib (gthread-posix.c): Unexpected error from C library during ‘pthread_setspecific’: Invalid argument. Aborting

Due to a browser limitation, loading multiple Gradio sessions in the same browser may cause Gradio sessions to get stuck or appear to be slow.
Guardrails might not reject some prompts that are expected to be rejected. This could be because the prompt might be relevant in other contexts as well as topics in the prompt might not be configured to be rejected. You can try tuning the guardrails configuration if required.
OpenAI connection errors or 429 (too many requests) errors might be seen sometimes if too many requests are sent to GPT-4v or GPT-4o VLMs. It can be due to lower TPM/RPM limits associated with the OpenAI account.
CA-RAG Summarization might show a truncated summary response. This is due to the max_tokens. Try increasing the number in the CA-RAG config file.

Helm deployment: VSS deployment pod fails with Error:

(LLM call Exception: llm-nim-svc)

Inspite of having a init container wait for LLM pod to come up, VSS deployment can for an unknown reason error out like below.

2024-11-27 17:51:44,763 [91mERROR[0m Failed to load VIA stream handler - LLM Call Exception: HTTPConnectionPool(host='llm-nim-svc', port=8000): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2c9d0ad6c0>: Failed to establish a new connection: [Errno 111] Connection refused'))

If this happens, please wait for additional few minutes and a pod restart fixes the issue.

Users can monitor this using:

sudo watch microk8s kubectl get pod

Gradio UI might be slow to load the thumbnails and video preview for longer videos. This becomes especially noticeable over slower network connections.
Deleting RTSP streams can be hung sometimes. This is because, rtspsrc indefinitely retries TCP transport after UDP timeout when the timeout property is set. Once TCP link is established, pipeline teardown hangs. This is a GSTreamer issue: https://gitlab.freedesktop.org/gstreamer/gstreamer/-/issues/1570. Workaround can be exporting VSS_RTSP_TIMEOUT=0. This will disable TCP transport after UDP timeout. But, this could cause streaming to not work at all when network is not good.

summarize_top_p, summarize_temperature, summarize_max_tokens	LLM Sampling parameters for summarization.
chat_top_p, chat_temperature, chat_max_tokens	LLM Sampling parameters for Q&A .
notification_top_p, notification_temperature, notification_max_tokens	LLM Sampling parameters for alerts/event detection.

vlm_input_width, vlm_input_height	Configure the input resolution of the frames to the VLM
num_frames_per_chunk	Configure the number of frames to sample from each chunk
summarize_batch_size	LLM Batch Size for summarization.
rag_type	Choose between “graph-rag” and “vector-rag”
rag_top_k	Number of top rerank results to use during Q&A
rag_batch_size	Number of VLM captions to be batched together for creating graph