VLM Video Summarization#

Overview#

Video recording systems collect vast amounts of data, often with sparse but significant events of interest. An event when occurring may prolong for a period of time, but may not merit an extended mention based on its duration of occurrence. These aspects make efficient but effective video summarization an important feature in the overall usefulness of an AI based video system. Generative AI provides an accurate, generalizable technique based on natural language interfaces for performing video summarization that the industry has actively investigated. The video summarization microservice addresses these functional and design requirements that can be leveraged out of the box by users. It’s design and functionality is modelled after the Video Search and Summarization (VSS) Agent Blueprint from NVIDIA released for Tesla GPUs.

Usage of the video summarization service involves a 2-step process, achieving API compatibility with the VSS Blueprint. The user first uploads a file through the files API, which returns a handle. The user could then launch summarization functionality through invocation of the summarize API.

The video summarization microservice is based on the NanoLLM framework for NVIDIA Jetson platform. The microservice employs few key strategies:

Event Embeddings and Clustering: For detected events, we generated frame embeddings using SigLIP and developed a DBSCAN-based clustering algorithm. It groups similar events to support sophisticated detection.
Video Content Retrieval: By storing embeddings in a CUDA-accelerated vector database, we enabled video content retrieval functionality. This allows users to search for specific events using natural language queries.
Video Summarization: We created video summaries using the VILA1.5 model, taking multiple frames as input to generate captions. These are combined with GPT-4 for concise, representative text summaries of events. Essentially, the system runs efficiently on edge with continuous recordings and VLMs, benefiting from our optimizations:

Two specific optimization uses include:

Quantization Techniques: We use INT4 MLC and AWQ-quantized VLMs to accelerate inference.
KV Cache and CUDA-accelerated Optimization: We develop KV cache to optimize the LLM token prefill process, enabling efficient VLM input processing
Faiss_lite: for accelerated retrieval

Running Video Summarization#

Start the Redis platform service
$ sudo systemctl start jetson-redis

Create the config file

Create a json config file with the following contents. openai_api_key should be updated to your OpenAI API key. This key is necessary for the summarization service to work properly.

{
    "api_server_port": 19000,
    "redis_host": "localhost",
    "redis_port": 6379,
    "redis_stream": "test",
    "redis_output_interval": 60,
    "log_level": "INFO",
    "jetson_ip": "localhost",
    "video_port": 81,
    "max_images": 8,
    "ingress_port": 30080,
    "streamer_port": 31000,
    "segment_time": 20,
    "openai_api_key": "key"
}

Pull and start the summarization container
$ sudo docker pull nvcr.io/nvidia/jps/vlm_summarization:2.0.9 $ sudo docker run -itd --runtime nvidia --network host -v ./config/main_config.json:/configs/main_config.json -v /data/vsm-videos/:/data/videos/ -v /data/vsm-cache/:/root/.cache/huggingface/ -e CONFIG_PATH="/configs/main_config.json" nvcr.io/nvidia/jps/vlm_summarization:2.0.9
Note that the docker run command may need to be modified to change the name/location of your config file created in the previous step depending on where it’s located. Similarly the /data/videos/ and /data/vsm-cache/ locations may need to be modified depending on where input videos are stored on your system and where you’d like the cache stored.
View container logs to determine when the service is fully up. This will be the case once you see something similar to below printed out. This may take some time as model files will need to be pulled.
INFO: Started server process [1] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:19000 (Press CTRL+C to quit)
Upload Video Files
Upload video files using the following curl command:
$ curl -X POST "http://localhost:19000/files?filepath=/data/videos/demo.mp4"
Filepath is based on where the file is within the container.

Request Summarization

Request a summarization of the video using:
$ curl -X POST "http://localhost:19000/summarize" \
    -H "accept: application/json" \
    -H "Content-Type: application/json" \
    -d '{"stream": false, "id": "01fc1241-27a2-4f91-a771-bcb65d0846ba", "model": "vila-1.5", "chunk_duration": 20}'
Change the ID value to the ID returned back from step 4. This process can take some time and will vary greatly depending on the length of the input video.

The model used can either be vila-1.5 or gpt-4o.

Fetch Summarization Results
Retrieve the summarization results with:
$ curl -X GET "http://localhost:19000/summarize?stream=false&id=01fc1241-27a2-4f91-a771-bcb65d0846ba"
Change the ID value to the ID returned back from step 4. This will return an error until the video is fully processed.