Vision Microservices#

The Vision Microservices extract rich visual features, semantic embeddings, and natural language understanding from video. Each is a standalone, GPU-accelerated microservice with documented REST and MCP interfaces, and can be deployed independently or composed into an agent workflow.

  • Object Detection and Tracking - Real-time object detection, classification, and multi-object tracking on single or multi-camera streams.

  • Real-Time VLM - Vision Language Models that generate captions, detect incidents, and identify anomalies in video streams.

  • Video Embedding Generation - Semantic embeddings from video, images, and live RTSP streams for search and similarity matching.

  • Video Summarization - Analysis and summarization of extended video recordings through chunking and aggregation of dense captions.