Language Models#

Megatron Core supports the following language model architectures for large-scale training.

Converting HuggingFace Models#

Use Megatron Bridge to convert HuggingFace models to Megatron format. Megatron Bridge is the official standalone converter with support for an extensive list of models including LLaMA, Mistral, Mixtral, Qwen, DeepSeek, Gemma, Phi, Nemotron, and many more.

See the Megatron Bridge supported models list for the complete and up-to-date list.

Decoder-Only Models#

Model

Description

Key Features

GPT

Generative Pre-trained Transformer

Standard autoregressive LM, foundational architecture

LLaMA

Meta’s LLaMA family

Efficient architecture with RoPE, SwiGLU, RMSNorm

Mistral

Mistral AI models

Sliding window attention, efficient inference

Mixtral

Sparse Mixture-of-Experts

8x7B MoE architecture for efficient scaling

Qwen

Alibaba’s Qwen series

HuggingFace integration, multilingual support

Mamba

State Space Model

Subquadratic sequence length scaling, efficient long context

Encoder-Only Models#

Model

Description

Key Features

BERT

Bidirectional Encoder Representations

Masked language modeling, classification tasks

Encoder-Decoder Models#

Model

Description

Key Features

T5

Text-to-Text Transfer Transformer

Unified text-to-text framework, sequence-to-sequence

Retrieval-Augmented Models#

Model

Description

Key Features

RETRO

Retrieval-Enhanced Transformer

Retrieval-augmented generation, knowledge grounding

Example Scripts#

Training examples for these models can be found in the examples/ directory:

  • examples/gpt3/ - GPT-3 training scripts

  • examples/llama/ - LLaMA training scripts

  • examples/mixtral/ - Mixtral MoE training

  • examples/mamba/ - Mamba training scripts

  • examples/bert/ - BERT training scripts

  • examples/t5/ - T5 training scripts

  • examples/retro/ - RETRO training scripts

Model Implementation#

All language models are built using Megatron Core’s composable transformer blocks, enabling:

  • Flexible parallelism strategies (TP, PP, DP, EP, CP)

  • Mixed precision training (FP16, BF16, FP8)

  • Distributed checkpointing

  • Efficient memory management