Vision Microservices#

The Vision Microservices extract rich visual features, semantic embeddings, and natural language understanding from video. Each is a standalone, GPU-accelerated microservice with documented REST and MCP interfaces, and can be deployed independently or composed into an agent workflow.

Object Detection and Tracking - Real-time object detection, classification, and multi-object tracking on single or multi-camera streams.
Real-Time VLM - Vision Language Models that generate captions, detect incidents, and identify anomalies in video streams.
Video Embedding Generation - Semantic embeddings from video, images, and live RTSP streams for search and similarity matching.
Video Summarization - Analysis and summarization of extended video recordings through chunking and aggregation of dense captions.