Multimodal Model Fine-Tuning#
TAO Toolkit supports fine-tuning of multimodal models that align or process more than one modality, such as images, video, and text. These models power applications including image-text retrieval, zero-shot classification, video retrieval, semantic deduplication, and embedding-based similarity search.
You can invoke all multimodal fine-tuning tasks using the TAO Launcher:
tao model <model-name> <action> -e /path/to/spec.yaml [overrides]
Supported Models#
Model |
Description |
|---|---|
A dual-encoder image-text model for zero-shot classification, retrieval, and embedding extraction. |
|
Cosmos-Embed1 |
A dual-encoder video-text embedding model for text-to-video retrieval, video-to-video retrieval, semantic deduplication, and targeted filtering of video datasets. |