Overview#

TAO Toolkit supports fine-tuning of embedding models that align or process more than one modality, such as images, video, and text. These models power applications including image-text retrieval, zero-shot classification, video retrieval, semantic deduplication, and embedding-based similarity search.

CLIP

A dual-encoder image-text model for zero-shot classification, retrieval, and embedding extraction.

CLIP
Cosmos-Embed1

A dual-encoder video-text embedding model for text-to-video and video-to-video retrieval, semantic deduplication, and targeted dataset filtering.

Cosmos-Embed1