Multimodal Model Fine-Tuning#

TAO Toolkit supports fine-tuning of multimodal models that align or process more than one modality, such as images, video, and text. These models power applications including image-text retrieval, zero-shot classification, video retrieval, semantic deduplication, and embedding-based similarity search.

You can invoke all multimodal fine-tuning tasks using the TAO Launcher:

tao model <model-name> <action> -e /path/to/spec.yaml [overrides]

Supported Models#

Model	Description
CLIP	A dual-encoder image-text model for zero-shot classification, retrieval, and embedding extraction.
Cosmos-Embed1	A dual-encoder video-text embedding model for text-to-video retrieval, video-to-video retrieval, semantic deduplication, and targeted filtering of video datasets.