Vision Language Models#

This directory contains documentation for Vision Language Models (VLMs) supported by Megatron Bridge. These models combine vision and language capabilities for multimodal AI applications.

Available Models#

Megatron Bridge supports the following VLM families:

Model

Documentation

Description

Gemma 3 VL

gemma3-vl.md

Google Gemma 3 Vision Language model

Nemotron Nano V2 VL

nemotron-nano-v2-vl.md

NVIDIA Nemotron Nano V2 Vision Language model

Qwen2.5 VL

qwen2.5-vl.md

Alibaba Cloud Qwen2.5 Vision Language model

Qwen3 VL

qwen3-vl.md

Alibaba Cloud Qwen3 Vision Language model

Quick Navigation#

I want to#

🔍 Find a specific VLM model → Browse the model list above or use the index page

🔄 Convert models between formats → Each model page includes conversion examples for Hugging Face ↔ Megatron Bridge

🚀 Get started with training → See Training Documentation for training guides

📚 Understand VLM architecture → Each model page documents vision-language architecture features

🔧 Add support for a new VLM → Refer to Adding New Models

Vision Language Model Features#

VLMs typically support:

  • Image Understanding - Processing and understanding visual inputs

  • Multimodal Fusion - Combining vision and language representations

  • Vision-Language Tasks - Image captioning, visual question answering, and more

  • Cross-Modal Learning - Learning relationships between visual and textual data


Ready to explore? Choose a model from the list above or return to the main documentation.