Supported Models#

This directory contains family-organized documentation for models supported by Megatron Bridge. Each model page covers supported variants, Hugging Face <-> Megatron Bridge conversion, training recipe links, and model-specific notes.

Family Index#

Quick Navigation#

I want to#

Find model-specific docs -> Browse the family index above or use the navigation for the model’s family.

Convert models between formats -> See Bridge Guide for Hugging Face <-> Megatron conversion basics. Model pages include model-specific commands where available.

Get started with training -> See Training Documentation for training guides and Recipe Usage for pre-configured training recipes.

Add support for a new model -> Refer to Adding New Models.

Model Documentation Structure#

Each model documentation page typically includes:

  1. Model Overview - Architecture and key features

  2. Available Variants - Supported model sizes and configurations

  3. Conversion Examples - Converting between Hugging Face and Megatron formats

  4. Training Recipes - Links to training configurations and examples

  5. Architecture Details - Model-specific features and configurations

Model Support Overview#

Decoder-Only and Hybrid Backbones#

  • Bailing, DeepSeek, Falcon, Gemma, GLM, GPT-OSS, Kimi, Llama, MiniMax, Mistral, Moonlight, Nemotron, OLMoE, Qwen, Sarvam, StepFun, and Xiaomi-MiMo

  • MoE and hybrid variants including Bailing, DeepSeek, GLM, GPT-OSS, MiniMax, Nemotron-3, OLMoE, Qwen3-MoE, Qwen3-Next, and Sarvam

Multimodal Variants#

  • Gemma 3 VL and Gemma 4 VL

  • GLM-4.5V

  • Kimi-K2.5-VL

  • Ministral 3

  • Nemotron Nano V2 VL and Nemotron-3 Nano Omni

  • Qwen2-Audio, Qwen2.5-VL, Qwen2.5-Omni, Qwen3-VL, Qwen3.5 / 3.6, Qwen3-Omni, and Qwen3-ASR

Conversion Support#

All model pages document support for one or both conversion directions:

  • Hugging Face -> Megatron Bridge: Load pretrained weights for training

  • Megatron Bridge -> Hugging Face: Export trained models for deployment

Conversion features:

  • Automatic architecture detection

  • Parallelism-aware conversion (TP/PP/VPP/CP/EP)

  • Streaming and memory-efficient transfers

  • Verification mechanisms for conversion accuracy

Refer to the Bridge Guide for detailed conversion instructions.