Context-Aware RAG#

../_images/ca_rag_architecture.jpeg

The VSS Context-Aware RAG is responsible for the video search and summarization based on the dense captions generated from the data processing pipeline. It implements the following data pipelines to achieve it:

  • Data Ingestion:

    • Responsible for receiving and processing incoming documents.

    • Uses the context manager to add documents, process them via batching, and prepare them for subsequent retrieval.

  • Data Retrieval:

    • Focuses on extracting the relevant context in response to user queries.

    • Leverages Graph-RAG and Vector-RAG functions to deliver precise, context-aware answers.

Context-Aware RAG starts its own process and event loop as to not block the main process and to increase performance.

Data Ingestion#

  • Parallel and Asynchronous Ingestion:

    • Documents are ingested in parallel and asynchronously processed to avoid blocking the main process.

    • Documents can be added with doc_index and doc_meta.

      • doc_index is used to uniquely identify the document (doc_0, doc_1, doc_2, etc.).

      • doc_meta is used to store additional metadata about the document ({stream_id: 0, timestamp: 1716393600}).

    • Processing is done in Context-Aware RAG separate process.

    • Documents can arrive in any order.

  • Batcher:

    • Batcher groups documents into fixed-size, duplicate-checked batches (using doc_id//batch_size).

    • When a batch fills up, downstream processing can be triggered immediately (e.g., graph extraction).

  • Example Batching Process (with batch_size = 2):

    • If doc_4 arrives first, it’s placed in batch 2 (doc_id 4 // batch_size 2 = 2)

    • When doc_0 arrives, it’s placed in batch 0 (doc_id 0 // batch_size 2 = 0)

    • When doc_1 arrives, it completes batch 0 (doc_id 1 // batch_size 2 = 0)

    • Batch 0 is now full and triggers asynchronous processing:

      • Partial graph construction begins for documents 0 and 1

    • This process continues until all documents arrive

    • Once all batches are processed, the final lexical graph is constructed.

Summarization#

CA-RAG provides the following methods for summarizing content.

  • Batch: This method performs summarization in two stages:

    • Batching: Groups together documents into batches and generates summaries for each batch.

    • Aggregation: Combines batch summaries using a secondary prompt (summary_aggregation).

    This method is ideal for handling long videos.

Data Retrieval (QnA)#

CA-RAG supports Question-Answering (QnA) functionality via VectorRAG and GraphRAG. Multi-stream support is available only for GraphRAG. GraphRAG or VectorRAG can be configured through CA-RAG Configuration.

  • VectorRAG

    • Captions generated by the Vision-Language Model (VLM), along with their embeddings, are stored in Milvus DB.

    • Embeddings can be created using any embedding NIM

    • For a query, the top five most similar chunks are retrieved, re-ranked using any reranker NIM and passed to a Large Language Model (LLM) NIM to generate the final answer.

  • GraphRAG

    • Graph Extraction: Entities and relationships are extracted from VLM captions, using an LLM, and stored in a GraphDB. Captions and embeddings, generated with any embedding NIM, are also linked to these entities.

    • Graph Retrieval: For a given query, relevant entities, relationships, and captions are retrieved from the GraphDB and passed to an LLM NIM to generate the final answer.

    • Multi-stream Support:
      • CA-RAG supports multi-stream processing, allowing users to process multiple live streams or files concurrently.

      • For multi-stream processing, stream-id is stored with each caption and entity. This allows to retrieve the captions and corresponding entities and relationships for a specific stream.

      • To enable multi-stream processing, set the multi-channel parameter to true in the config/config.yaml file. By default, it is set to false.

Alerts#

The Alerts feature allows event-based notifications. For each VLM caption, an LLM analyzes and generates alerts based on the natural language defined event criteria.

For example, to configure alerts for a traffic video to detect accidents, the criteria can be defined in natural language in the UI:

incident: accident on the road;
response: first responders arrive for help;

When an alert is detected, the response is sent to the user via the VSS notification system. Here is an example of the alert notification:

Alert Name: incident
Detected Events: accident on the road
Time: 80 seconds
Details: 2025-03-15 12:07:39 PM: The scene depicts an intersection with painted
stop lines and directional arrows on the road surface. A red sedan and a yellow
sedan are involved in a collision within the intersection. The red sedan appears to
be impacting the yellow sedan on its front passenger side.