Model customization overview#

The Video Search and Summarization (VSS) blueprint uses embedding models in two places: object-level embeddings from the Real-Time Computer Vision (RT-CV) microservice for detection crops and text queries, and video-level embeddings from the Real-Time Embedding microservice for clip- and stream-based semantic search. Recent TAO Toolkit releases support fine-tuning the same foundation models that these services use, so you can specialize embeddings for your domain and swap the artifacts into your deployment.

Industry blueprint deployments (for example Warehouse Operations) also ship computer vision perception models (Sparse4D, RT-DETR) that you can fine-tune with TAO and integrate into the Perception microservice. See Sparse4D and RT-DETR.

This section maps which components accept customized weights, where to fine-tune (documentation and NGC model cards), and how to integrate fine-tuned models. For low-level service settings (environment variables, DeepStream config keys, APIs), see the linked microservice guides.

Note

Changing embedding dimension or similarity geometry usually requires re-indexing stored vectors (for example Elasticsearch indices used by search workflows) and revisiting similarity thresholds in analytics. Plan a full pipeline validation after swapping models.

Summary#

VSS component	Embedding role	Supported model families	Primary integration surface
Object Detection and Tracking (RT-CV)	Object crop embeddings and optional text embeddings (ReID, text-to-image alignment)	SigLIP 2	DeepStream INI: vision encoder + text embedder (ONNX / TensorRT); see ReID and Embeddings on that page
Real-Time Embedding (RT-Embedding)	Video and text embeddings for semantic search and Kafka-published clip features	Cosmos-Embed1	Container environment: `MODEL_PATH`, optional `MODEL_IMPLEMENTATION_PATH` / Triton repo scripts; see Customizations on that page

Per-model guides#

SigLIP 2 object embeddings — SigLIP 2 for RT-CV
Cosmos-Embed1 (video embedding) — Cosmos-Embed1 for RT-Embedding
Sparse4D — TAO Sparse4D for multi-camera 3D detection and tracking (warehouse blueprint)
RT-DETR — TAO RT-DETR for 2D warehouse detection (warehouse blueprint)

Agent and search workflows#

Developer profiles and agent tools that call embedding endpoints or query Elasticsearch (see VSS-Agent-Profiles) assume compatible embedding spaces with the indexed documents. After swapping RT-CV or RT-Embedding models, confirm:

API payloads still match expected vector sizes.
Search and fusion logic (for example multi-embedding or attribute search) still uses comparable similarity scales.

For LLM and VLM configuration (not embedding fine-tuning), see Configure the LLM and Configure the VLM.

Model customization overview#

Summary#

Per-model guides#

Agent and search workflows#

Related documentation#