Megatron Bridge Documentation#
Welcome to the Megatron Bridge documentation! This guide helps you navigate our comprehensive documentation to find exactly what you need for training, converting, and working with language, vision-language, audio, and multimodal models.
π Quick Start Paths#
I want to#
πββοΈ Get started with model conversion β Start with Bridge Guide for Hugging Face β Megatron conversion
β‘ Understand parallelisms and performance β Jump to Parallelisms Guide and Performance Guide
π Start training a model β See Training Documentation for comprehensive training guides
π Find model documentation β Browse Supported Models, organized by family and model variant
π§ Migrate from NeMo 2 or Megatron-LM β Check NeMo 2 Migration Guide or Megatron-LM Migration Guide
π Use training recipes β Read Recipe Usage for pre-configured training recipes
π Add support for a new model β Refer to Adding New Models
π Check version information β See Releases Documentation for versions, changelog, and known issues
π₯ Documentation by Role#
For ML Engineers & Researchers#
Start here: Bridge Guide β Training Documentation
Deep dive: Performance Guide β Training Optimization Guides
Model support: Supported Models β Adding New Models
For Training Engineers#
Start here: Training Documentation β Configuration Container Overview
Performance: Performance Guide β Performance Summary
Parallelisms: Parallelisms Guide β Training Optimization
For Model Developers#
Start here: Bridge Guide β Bridge Tech Details
Model support: Adding New Models β Model Documentation
Integration: Bridge RL Integration
For DevOps & Platform Teams#
Start here: Releases Documentation β Software Versions
Troubleshooting: Known Issues
API Reference: API Documentation
π Complete Documentation Index#
Getting Started#
Document |
Purpose |
When to Read |
|---|---|---|
Hugging Face β Megatron conversion guide |
First time converting models |
|
Technical details of the bridge system |
Understanding bridge internals |
|
Data and model parallelism strategies |
Setting up distributed training |
|
Quick performance reference |
Quick performance lookup |
|
Comprehensive performance optimization |
Optimizing training performance |
Model Support#
Document |
Purpose |
When to Read |
|---|---|---|
Family-organized model documentation |
Working with supported model variants |
|
Guide for adding model support |
Extending model support |
Training and Customization#
Document |
Purpose |
When to Read |
|---|---|---|
Comprehensive training guides |
Setting up and customizing training |
|
Central training configuration |
Understanding training configuration |
|
Training entry points and execution |
Understanding training flow |
|
Training loop parameters |
Configuring training parameters |
|
Optimization configuration |
Setting up optimizers |
|
Mixed precision training |
Reducing memory usage |
|
Parameter-efficient fine-tuning |
Fine-tuning with limited resources |
|
Checkpoint management |
Saving and resuming training |
|
Logging and monitoring |
Monitoring training progress |
|
Performance profiling |
Identifying bottlenecks |
Recipes and Workflows#
Document |
Purpose |
When to Read |
|---|---|---|
Using pre-configured training recipes |
Quick training setup |
|
Reinforcement learning integration |
RL training workflows |
Migration Guides#
Document |
Purpose |
When to Read |
|---|---|---|
Migrating from NeMo 2 |
Upgrading from NeMo 2 |
|
Migrating from Megatron-LM |
Upgrading from Megatron-LM |
Reference#
Document |
Purpose |
When to Read |
|---|---|---|
Complete API reference |
Building integrations |
|
Version history and known issues |
Checking versions, troubleshooting |
|
Contributing to documentation |
Contributing docs |
πΊοΈ Common Reading Paths#
π First-Time Users#
Bridge Guide (10 min - understand conversion)
Parallelisms Guide (15 min - understand distributed training)
Training Documentation (choose your training path)
Recipe Usage (5 min - use pre-configured recipes)
π§ Setting Up Training#
Training Documentation (overview of training system)
Configuration Container Overview (understand configuration)
Entry Points (how training starts)
Training Loop Settings (configure parameters)
Logging (set up monitoring)
β‘ Performance Optimization#
Performance Guide (comprehensive optimization strategies)
Performance Summary (quick reference)
Mixed Precision (reduce memory usage)
Communication Overlap (optimize distributed training)
Activation Recomputation (reduce memory footprint)
Profiling (identify bottlenecks)
π Model Conversion Workflow#
Bridge Guide (conversion basics)
Bridge Tech Details (technical details)
Supported Models (model-specific guides)
Adding New Models (extend support)
π§ Customization and Extension#
Training Documentation (training customization)
PEFT (parameter-efficient fine-tuning)
Adding New Models (add model support)
Bridge RL Integration (RL workflows)
πͺ Model Optimization (ModelOpt)#
Model Optimization Overview (quantization, distillation, pruning)
Quantization (PTQ and QAT)
Distillation (knowledge distillation)
Pruning (Minitron-based pruning)
π¦ Migration Paths#
NeMo 2 Migration Guide (from NeMo 2)
Megatron-LM Migration Guide (from Megatron-LM)
Training Documentation (new training system)
π Directory Structure#
Main Documentation#
Guides - Core guides for parallelisms, performance, recipes, and migration
Bridge Documentation - Hugging Face β Megatron conversion guides
Model Documentation - Supported model families and architectures
Subdirectories#
models/#
training/#
Configuration - ConfigContainer, entry points, training loop settings
Optimization - Optimizer, scheduler, mixed precision, communication overlap
Performance - Attention optimizations, activation recomputation, CPU offloading
Monitoring - Logging, profiling, checkpointing, resiliency
Advanced - PEFT, packed sequences
modelopt/#
Quantization - PTQ and QAT workflows (FP8, INT8, INT4)
Distillation - Knowledge distillation from teacher to student
Pruning - Minitron-based depth and width pruning
releases/#
Software Versions - Current versions and dependencies
Changelog - Release history and changes
Known Issues - Bugs, limitations, and workarounds
π How Documents Connect#
graph TD
A[README.md<br/>Start Here] --> B[Bridge Guide<br/>Model Conversion]
A --> C[Training Docs<br/>Training Setup]
A --> D[Models<br/>Model Support]
B --> E[Bridge Tech Details<br/>Technical Deep Dive]
B --> F[Supported Models<br/>Model-Specific Guides]
C --> G[Config Container<br/>Configuration]
C --> H[Performance Guide<br/>Optimization]
C --> I[Parallelisms<br/>Distributed Training]
G --> J[Training Loop<br/>Training Parameters]
G --> K[Optimizer & Scheduler<br/>Optimization Setup]
H --> L[Mixed Precision<br/>Memory Efficiency]
H --> M[Communication Overlap<br/>Performance]
I --> N[Data Parallelism<br/>DDP]
I --> O[Model Parallelism<br/>TP/PP/VPP]
D --> P[Family Index<br/>Model Variants]
D --> Q[Model Pages<br/>Recipes and Conversion]
style A fill:#e1f5fe
style B fill:#f3e5f5
style C fill:#e8f5e8
style D fill:#fff3e0
style H fill:#fce4ec
style I fill:#e0f2f1
π€ Getting Help#
GitHub Issues: Report bugs or request features
Documentation Issues: Found something unclear? Let us know!
Community: Join discussions and share experiences
π Additional Resources#
Examples - Code examples and tutorials
Contributing Guide - How to contribute to the project
API Documentation - Complete API reference
Ready to get started? Choose your path above or dive into the Bridge Guide for model conversion! π