Megatron Bridge Documentation#
Welcome to the Megatron Bridge documentation! This guide helps you navigate our comprehensive documentation to find exactly what you need for training, converting, and working with large language models and vision language models.
π Quick Start Paths#
I want to#
πββοΈ Get started with model conversion β Start with Bridge Guide for Hugging Face β Megatron conversion
β‘ Understand parallelisms and performance β Jump to Parallelisms Guide and Performance Guide
π Start training a model β See Training Documentation for comprehensive training guides
π Find model documentation β Browse Supported Models for LLMs or Vision Language Models for VLMs
π§ Migrate from NeMo 2 or Megatron-LM β Check NeMo 2 Migration Guide or Megatron-LM Migration Guide
π Use training recipes β Read Recipe Usage for pre-configured training recipes
π Add support for a new model β Refer to Adding New Models
π Check version information β See Releases Documentation for versions, changelog, and known issues
π₯ Documentation by Role#
For ML Engineers & Researchers#
Start here: Bridge Guide β Training Documentation
Deep dive: Performance Guide β Training Optimization Guides
Model support: Supported Models β Adding New Models
For Training Engineers#
Start here: Training Documentation β Configuration Container Overview
Performance: Performance Guide β Performance Summary
Parallelisms: Parallelisms Guide β Training Optimization
For Model Developers#
Start here: Bridge Guide β Bridge Tech Details
Model support: Adding New Models β Model Documentation
Integration: Bridge RL Integration
For DevOps & Platform Teams#
Start here: Releases Documentation β Software Versions
Troubleshooting: Known Issues
API Reference: API Documentation
π Complete Documentation Index#
Getting Started#
Document |
Purpose |
When to Read |
|---|---|---|
Hugging Face β Megatron conversion guide |
First time converting models |
|
Technical details of the bridge system |
Understanding bridge internals |
|
Data and model parallelism strategies |
Setting up distributed training |
|
Quick performance reference |
Quick performance lookup |
|
Comprehensive performance optimization |
Optimizing training performance |
Model Support#
Document |
Purpose |
When to Read |
|---|---|---|
LLM model documentation |
Working with LLM models |
|
VLM model documentation |
Working with VLM models |
|
Guide for adding model support |
Extending model support |
Training and Customization#
Document |
Purpose |
When to Read |
|---|---|---|
Comprehensive training guides |
Setting up and customizing training |
|
Central training configuration |
Understanding training configuration |
|
Training entry points and execution |
Understanding training flow |
|
Training loop parameters |
Configuring training parameters |
|
Optimization configuration |
Setting up optimizers |
|
Mixed precision training |
Reducing memory usage |
|
Parameter-efficient fine-tuning |
Fine-tuning with limited resources |
|
Checkpoint management |
Saving and resuming training |
|
Logging and monitoring |
Monitoring training progress |
|
Performance profiling |
Identifying bottlenecks |
Recipes and Workflows#
Document |
Purpose |
When to Read |
|---|---|---|
Using pre-configured training recipes |
Quick training setup |
|
Reinforcement learning integration |
RL training workflows |
Migration Guides#
Document |
Purpose |
When to Read |
|---|---|---|
Migrating from NeMo 2 |
Upgrading from NeMo 2 |
|
Migrating from Megatron-LM |
Upgrading from Megatron-LM |
Reference#
Document |
Purpose |
When to Read |
|---|---|---|
Complete API reference |
Building integrations |
|
Version history and known issues |
Checking versions, troubleshooting |
|
Contributing to documentation |
Contributing docs |
πΊοΈ Common Reading Paths#
π First-Time Users#
Bridge Guide (10 min - understand conversion)
Parallelisms Guide (15 min - understand distributed training)
Training Documentation (choose your training path)
Recipe Usage (5 min - use pre-configured recipes)
π§ Setting Up Training#
Training Documentation (overview of training system)
Configuration Container Overview (understand configuration)
Entry Points (how training starts)
Training Loop Settings (configure parameters)
Logging (set up monitoring)
β‘ Performance Optimization#
Performance Guide (comprehensive optimization strategies)
Performance Summary (quick reference)
Mixed Precision (reduce memory usage)
Communication Overlap (optimize distributed training)
Activation Recomputation (reduce memory footprint)
Profiling (identify bottlenecks)
π Model Conversion Workflow#
Bridge Guide (conversion basics)
Bridge Tech Details (technical details)
Supported Models or Vision Language Models (model-specific guides)
Adding New Models (extend support)
π§ Customization and Extension#
Training Documentation (training customization)
PEFT (parameter-efficient fine-tuning)
Distillation (knowledge distillation)
Adding New Models (add model support)
Bridge RL Integration (RL workflows)
π¦ Migration Paths#
NeMo 2 Migration Guide (from NeMo 2)
Megatron-LM Migration Guide (from Megatron-LM)
Training Documentation (new training system)
π Directory Structure#
Main Documentation#
Guides - Core guides for parallelisms, performance, recipes, and migration
Bridge Documentation - Hugging Face β Megatron conversion guides
Model Documentation - Supported model families and architectures
Subdirectories#
models/#
training/#
Configuration - ConfigContainer, entry points, training loop settings
Optimization - Optimizer, scheduler, mixed precision, communication overlap
Performance - Attention optimizations, activation recomputation, CPU offloading
Monitoring - Logging, profiling, checkpointing, resiliency
Advanced - PEFT, packed sequences, distillation
releases/#
Software Versions - Current versions and dependencies
Changelog - Release history and changes
Known Issues - Bugs, limitations, and workarounds
π How Documents Connect#
graph TD
A[README.md<br/>Start Here] --> B[Bridge Guide<br/>Model Conversion]
A --> C[Training Docs<br/>Training Setup]
A --> D[Models<br/>Model Support]
B --> E[Bridge Tech Details<br/>Technical Deep Dive]
B --> F[Supported Models<br/>Model-Specific Guides]
C --> G[Config Container<br/>Configuration]
C --> H[Performance Guide<br/>Optimization]
C --> I[Parallelisms<br/>Distributed Training]
G --> J[Training Loop<br/>Training Parameters]
G --> K[Optimizer & Scheduler<br/>Optimization Setup]
H --> L[Mixed Precision<br/>Memory Efficiency]
H --> M[Communication Overlap<br/>Performance]
I --> N[Data Parallelism<br/>DDP]
I --> O[Model Parallelism<br/>TP/PP/VPP]
D --> P[LLM Models<br/>Language Models]
D --> Q[VLM Models<br/>Vision Language Models]
style A fill:#e1f5fe
style B fill:#f3e5f5
style C fill:#e8f5e8
style D fill:#fff3e0
style H fill:#fce4ec
style I fill:#e0f2f1
π€ Getting Help#
GitHub Issues: Report bugs or request features
Documentation Issues: Found something unclear? Let us know!
Community: Join discussions and share experiences
π Additional Resources#
Examples - Code examples and tutorials
Contributing Guide - How to contribute to the project
API Documentation - Complete API reference
Ready to get started? Choose your path above or dive into the Bridge Guide for model conversion! π