Typical Usage#
This section describes a typical use of VSS.
Deployment#
It starts with first setting up the environment and then deploying VSS using either Helm Chart or Docker Compose method.
VSS provides many options to customize the deployment to the user’s needs such as toggling features like audio, CV pipeline and Guardrails on and off, switch the various VLM and LLM models, configure the deployment topology based on the available system resources, etc.
For more details, refer to:
File / Live Stream Ingestion & Summarization & Alerts#
Once VSS is deployed, you can start summarizing files and live streams. This is the first operation that must be performed for any source as this enables the ingestion of the source into the VSS.
This can be done by using the Gradio UI, the reference python CLI or by using the REST API programatically.
First, you will add a file or live stream to the VSS using
For the Gradio UI, selecting a preloaded example or uploading a file or entering a live stream URL
For the reference CLI, use Add File or Add Live Stream commands
For the REST API, use /files [POST] or /live-stream [POST] endpoints
Supported formats include:
H.264 / H.265 video codecs
OPUS / AAC audio codecs
mp4 / mkv container formats
Certain codecs require installing additional codecs.
Next, summarize the source, The summarization API is the most important API as it configures the ingestion of the source. You can configure its various parameters.
When the summarize API is called, VLM captions will be generated for chunks of the input source as configured using the chunk_duration
parameter (CHUNK SIZE in UI) and chunk_overlap_duration parameters.
A higher chunk size will result in faster ingestion since lesser chunks are to be processed while a smaller chunk size may result in more accurate captions as well as detection of fast events.
VLM’s prompt
(PROMPT in UI) must be tuned for the use case to ensure accurate captions. Other VLM parameters like temperature
, top_p
, max_tokens
, top_k
, vlm_input_width
, vlm_input_height
, num_frames_per_chunk
can be tuned
for accuracy / speed trade-off.
Optionally, CV pipeline for Set-of-Marks prompting for VLM and Audio Transcription can be enabled. This will require additional models to be configured as well as increase the compute requirements and ingestion latency. This can be done using:
For the Gradio UI, the Enable Audio checkbox, Enable CV checkbox and the CV Pipeline Prompt input
For the reference CLI, configure as part of Summarization Command
For the REST API, configure as part of /summarize [POST] endpoint
As soon as the ingestion pipeline generates captions for a chunk, the captions are passed to the retrieval pipeline along with each chunk’s metadata for indexing and summarization.
If alerts are enabled, the retrieval pipeline will detect any configured events with the help of the configured LLM model. This will result in higher load for the LLM model. Alerts can be configured via:
For the Gradio UI, go to the Alerts tab
For the reference CLI, configure as part of Summarization Command or use Add Live Stream Alert (Live Streams only)
For the REST API, configure as part of /summarize [POST] endpoint or use /alerts [POST] endpoint (Live Streams only)
If chat is enabled, the retrieval pipeline with the help of the configured LLM and embedding models will ingest the VLM captions into the vector DB and/or the graph DB for Q&A. Various
Q&A related parameters such as enable_chat
, rag_type
(graph-rag
or vector-rag
), rag_batch_size
, rag_top_k
, chat_max_tokens
, chat_temperature
, chat_top_p
can be configured.
Using graph-rag
(default) will result in higher Q&A accuracy at the cost of higher LLM load and higher latency.
Various chat related parameters can be configured using:
For the Gradio UI, the Enable Chat and Enable Chat History checkboxes in addition to the various parameters in the Parameters dialog.
For the reference CLI, use various chat related arguments with Summarization Command command
For the REST API, use various chat related parameters with /summarize [POST] endpoint
Summarization parameters can also be configured. These include summary_duration
(SUMMARY DURATION in UI), LLM summarization prompts caption_summarization_prompt
and summary_aggregation_prompt
as well as LLM sampling parameters
summarize_batch_size
, summarize_max_tokens
, summarize_temperature
, and summarize_top_p
. Refer to Tuning Prompts for more details on the prompts. For files, entire file is summarized at once.
For live streams, summaries are generated every summary_duration
seconds.
Various summarization related parameters can be configured using:
For the Gradio UI, the Parameters dialog and the SUMMARY DURATION input (Live Streams only)
For the reference CLI, use various summarization related arguments with Summarization Command command
For the REST API, use various summarization related parameters with /summarize [POST] endpoint
Once all the parameters are configured, the summarize API can be called to start the summarization process.
For the Gradio UI, click on the Summarize button
For the reference CLI, use Summarization Command command
For the REST API, use /summarize [POST] endpoint
Chat - Q & A#
If chat is enabled as part of summarization, Q&A can be performed once file summarization is complete or for live streams once at least one summary is generated. For live-streams, information about newer data from the live stream will get added to the graph / vector DB with as the summaries are generated.
Q&A can be performed using:
For the Gradio UI, the Chat tab
For the reference CLI, use Chat Command command
For the REST API, use /chat/completions [POST] endpoint
Clean up#
Once the use of a stream is over, it can be deleted using:
For the Gradio UI, click on the Delete button
For the reference CLI, use Delete File or Delete Live Stream command
For the REST API, use /files/{file_id} [DELETE] or /live-stream/{stream_id} [DELETE] endpoint
It is recommended to delete the file / live-stream after the use is over to free up the resources and prevent the graph DB from growing too large.
Multiple Streams / Concurrent Requests#
VSS supports multiple streams and concurrent requests. Use cases include processing multiple short video files or analyzing and processing multiple live camera feeds.
Clients can call any of the APIs including Summarization (POST /summarize) and Q&A (POST /chat/completions) in parallel for different files / live streams from different threads or processes. Clients do not have to worry about queuing and synchronization since the VSS backend will take care of queuing and scheduling the requests. VSS with the help of CA-RAG will also be responsible for maintaining contexts for each of the sources.
For more details refer to: