Vision Language Models#

This directory contains documentation for Vision Language Models (VLMs) supported by Megatron Bridge. These models combine vision and language capabilities for multimodal AI applications.

Available Models#

Megatron Bridge supports the following VLM families:

Model	Documentation	Description
Gemma 3 VL	gemma3-vl.md	Google Gemma 3 Vision Language model
Nemotron Nano V2 VL	nemotron-nano-v2-vl.md	NVIDIA Nemotron Nano V2 Vision Language model
Qwen2.5 VL	qwen2.5-vl.md	Alibaba Cloud Qwen2.5 Vision Language model
Qwen3 VL	qwen3-vl.md	Alibaba Cloud Qwen3 Vision Language model

Vision Language Model Features#

VLMs typically support:

Image Understanding - Processing and understanding visual inputs
Multimodal Fusion - Combining vision and language representations
Vision-Language Tasks - Image captioning, visual question answering, and more
Cross-Modal Learning - Learning relationships between visual and textual data

Ready to explore? Choose a model from the list above or return to the main documentation.