Overview#
TAO Toolkit supports fine-tuning of embedding models that align or process more than one modality, such as images, video, and text. These models power applications including image-text retrieval, zero-shot classification, video retrieval, semantic deduplication, and embedding-based similarity search.
CLIP
A dual-encoder image-text model for zero-shot classification, retrieval, and embedding extraction.
Cosmos-Embed1
A dual-encoder video-text embedding model for text-to-video and video-to-video retrieval, semantic deduplication, and targeted dataset filtering.