Vision-Language Model (VLM) Fine-Tuning#
TAO supports fine-tuning of Vision-Language Models (VLMs), which can understand and process both visual and textual information. VLM fine-tuning enables you to adapt pretrained multimodal models to your specific use cases, such as video understanding, visual question answering, and multimodal content generation.
The VLM fine-tuning pipeline in TAO is designed to work with state-of-the-art vision-language models, and provides:
Multimodal understanding: Trains models that can process both visual (images/videos) and textual inputs.
Flexible data formats: Supports various annotation formats, including LLaVA format.
AutoML integration: Automates hyperparameter optimization for optimal model performance.
Scalable training: Supports multi-GPU and distributed training capabilities.
Agent-driven workflow: The TAO agent launches fine-tuning through the
tao-skillsplugin.
Supported Models#
TAO currently supports the following VLM architecture:
Cosmos-Reason: A state-of-the-art video-language model for video understanding tasks.
Key Features#
Multimodal data processing: Handle datasets containing both visual content (images and videos) and corresponding text annotations.
Advanced training techniques: Leverages techniques like LoRA (Low-Rank Adaptation) for efficient fine-tuning of large models.
AutoML support: Automatically optimizes hyperparameters, including learning rates, batch sizes, and training policies.
Cloud integration: Supports seamless integration with cloud storage services (AWS, Azure) for dataset management and result storage.
Getting Started#
To get started with VLM fine-tuning:
Set up the agent: Follow the TAO getting started guide to install the
tao-skillsplugin and export the credentials your chosen compute backend needs.Prepare data: Organize your multimodal dataset in a supported format (refer to the model page below for specifics).
Drive training from the agent: Describe the run in plain English: the model, dataset URI, key hyperparameters, and backend. The agent resolves the specification keys from the model skill and dispatches via the TAO Execution SDK.
Monitor training: Ask the agent for status, logs, or metrics on the running job.