Megatron Bridge Documentation#

Welcome to the Megatron Bridge documentation! This guide helps you navigate our comprehensive documentation to find exactly what you need for training, converting, and working with large language models and vision language models.

πŸš€ Quick Start Paths#

I want to#

πŸƒβ€β™‚οΈ Get started with model conversion β†’ Start with Bridge Guide for Hugging Face ↔ Megatron conversion

⚑ Understand parallelisms and performance β†’ Jump to Parallelisms Guide and Performance Guide

πŸš€ Start training a model β†’ See Training Documentation for comprehensive training guides

πŸ“š Find model documentation β†’ Browse Supported Models for LLMs or Vision Language Models for VLMs

πŸ”§ Migrate from NeMo 2 or Megatron-LM β†’ Check NeMo 2 Migration Guide or Megatron-LM Migration Guide

πŸ“Š Use training recipes β†’ Read Recipe Usage for pre-configured training recipes

πŸ”Œ Add support for a new model β†’ Refer to Adding New Models

πŸ“‹ Check version information β†’ See Releases Documentation for versions, changelog, and known issues


πŸ‘₯ Documentation by Role#

For ML Engineers & Researchers#

For Training Engineers#

For Model Developers#

For DevOps & Platform Teams#


πŸ“š Complete Documentation Index#

Getting Started#

Document

Purpose

When to Read

Bridge Guide

Hugging Face ↔ Megatron conversion guide

First time converting models

Bridge Tech Details

Technical details of the bridge system

Understanding bridge internals

Parallelisms Guide

Data and model parallelism strategies

Setting up distributed training

Performance Summary

Quick performance reference

Quick performance lookup

Performance Guide

Comprehensive performance optimization

Optimizing training performance

Model Support#

Document

Purpose

When to Read

Large Language Models

LLM model documentation

Working with LLM models

Vision Language Models

VLM model documentation

Working with VLM models

Adding New Models

Guide for adding model support

Extending model support

Training and Customization#

Document

Purpose

When to Read

Training Documentation

Comprehensive training guides

Setting up and customizing training

Configuration Container Overview

Central training configuration

Understanding training configuration

Entry Points

Training entry points and execution

Understanding training flow

Training Loop Settings

Training loop parameters

Configuring training parameters

Optimizer & Scheduler

Optimization configuration

Setting up optimizers

Mixed Precision

Mixed precision training

Reducing memory usage

PEFT

Parameter-efficient fine-tuning

Fine-tuning with limited resources

Checkpointing

Checkpoint management

Saving and resuming training

Logging

Logging and monitoring

Monitoring training progress

Profiling

Performance profiling

Identifying bottlenecks

Recipes and Workflows#

Document

Purpose

When to Read

Recipe Usage

Using pre-configured training recipes

Quick training setup

Bridge RL Integration

Reinforcement learning integration

RL training workflows

Migration Guides#

Document

Purpose

When to Read

NeMo 2 Migration Guide

Migrating from NeMo 2

Upgrading from NeMo 2

Megatron-LM Migration Guide

Migrating from Megatron-LM

Upgrading from Megatron-LM

Reference#

Document

Purpose

When to Read

API Documentation

Complete API reference

Building integrations

Releases Documentation

Version history and known issues

Checking versions, troubleshooting

Documentation Guide

Contributing to documentation

Contributing docs


πŸ—ΊοΈ Common Reading Paths#

πŸ†• First-Time Users#

  1. Bridge Guide (10 min - understand conversion)

  2. Parallelisms Guide (15 min - understand distributed training)

  3. Training Documentation (choose your training path)

  4. Recipe Usage (5 min - use pre-configured recipes)

πŸ”§ Setting Up Training#

  1. Training Documentation (overview of training system)

  2. Configuration Container Overview (understand configuration)

  3. Entry Points (how training starts)

  4. Training Loop Settings (configure parameters)

  5. Logging (set up monitoring)

⚑ Performance Optimization#

  1. Performance Guide (comprehensive optimization strategies)

  2. Performance Summary (quick reference)

  3. Mixed Precision (reduce memory usage)

  4. Communication Overlap (optimize distributed training)

  5. Activation Recomputation (reduce memory footprint)

  6. Profiling (identify bottlenecks)

πŸ”„ Model Conversion Workflow#

  1. Bridge Guide (conversion basics)

  2. Bridge Tech Details (technical details)

  3. Supported Models or Vision Language Models (model-specific guides)

  4. Adding New Models (extend support)

πŸ”§ Customization and Extension#

  1. Training Documentation (training customization)

  2. PEFT (parameter-efficient fine-tuning)

  3. Distillation (knowledge distillation)

  4. Adding New Models (add model support)

  5. Bridge RL Integration (RL workflows)

πŸ“¦ Migration Paths#

  1. NeMo 2 Migration Guide (from NeMo 2)

  2. Megatron-LM Migration Guide (from Megatron-LM)

  3. Training Documentation (new training system)


πŸ“ Directory Structure#

Main Documentation#

  • Guides - Core guides for parallelisms, performance, recipes, and migration

  • Bridge Documentation - Hugging Face ↔ Megatron conversion guides

  • Model Documentation - Supported model families and architectures

Subdirectories#

models/#

  • llm/ - Large Language Model documentation

    • Individual model guides (Qwen, LLaMA, Mistral, etc.)

    • Conversion examples and training recipes

  • vlm/ - Vision Language Model documentation

    • VLM model guides (Qwen VL, Gemma VL, etc.)

    • Multimodal model support

training/#

  • Configuration - ConfigContainer, entry points, training loop settings

  • Optimization - Optimizer, scheduler, mixed precision, communication overlap

  • Performance - Attention optimizations, activation recomputation, CPU offloading

  • Monitoring - Logging, profiling, checkpointing, resiliency

  • Advanced - PEFT, packed sequences, distillation

releases/#

  • Software Versions - Current versions and dependencies

  • Changelog - Release history and changes

  • Known Issues - Bugs, limitations, and workarounds


πŸ”— How Documents Connect#

        graph TD
    A[README.md<br/>Start Here] --> B[Bridge Guide<br/>Model Conversion]
    A --> C[Training Docs<br/>Training Setup]
    A --> D[Models<br/>Model Support]
    
    B --> E[Bridge Tech Details<br/>Technical Deep Dive]
    B --> F[Supported Models<br/>Model-Specific Guides]
    
    C --> G[Config Container<br/>Configuration]
    C --> H[Performance Guide<br/>Optimization]
    C --> I[Parallelisms<br/>Distributed Training]
    
    G --> J[Training Loop<br/>Training Parameters]
    G --> K[Optimizer & Scheduler<br/>Optimization Setup]
    
    H --> L[Mixed Precision<br/>Memory Efficiency]
    H --> M[Communication Overlap<br/>Performance]
    
    I --> N[Data Parallelism<br/>DDP]
    I --> O[Model Parallelism<br/>TP/PP/VPP]
    
    D --> P[LLM Models<br/>Language Models]
    D --> Q[VLM Models<br/>Vision Language Models]
    
    style A fill:#e1f5fe
    style B fill:#f3e5f5
    style C fill:#e8f5e8
    style D fill:#fff3e0
    style H fill:#fce4ec
    style I fill:#e0f2f1
    

🀝 Getting Help#

  • GitHub Issues: Report bugs or request features

  • Documentation Issues: Found something unclear? Let us know!

  • Community: Join discussions and share experiences


πŸ“– Additional Resources#


Ready to get started? Choose your path above or dive into the Bridge Guide for model conversion! πŸš€