Vision Language Models#
This directory contains documentation for Vision Language Models (VLMs) supported by Megatron Bridge. These models combine vision and language capabilities for multimodal AI applications.
Available Models#
Megatron Bridge supports the following VLM families:
Model |
Documentation |
Description |
|---|---|---|
Gemma 3 VL |
Google Gemma 3 Vision Language model |
|
Nemotron Nano V2 VL |
NVIDIA Nemotron Nano V2 Vision Language model |
|
Qwen2.5 VL |
Alibaba Cloud Qwen2.5 Vision Language model |
|
Qwen3 VL |
Alibaba Cloud Qwen3 Vision Language model |
Vision Language Model Features#
VLMs typically support:
Image Understanding - Processing and understanding visual inputs
Multimodal Fusion - Combining vision and language representations
Vision-Language Tasks - Image captioning, visual question answering, and more
Cross-Modal Learning - Learning relationships between visual and textual data
Ready to explore? Choose a model from the list above or return to the main documentation.