Advanced Features# Guides for Megatron Core training features. CUDA Graph Overview CUDA Graph — Local Implementation (--cuda-graph-impl local) CUDA Graph — Transformer Engine Implementation (--cuda-graph-impl transformer_engine) Full-Iteration Training CUDA Graph (--cuda-graph-impl full_iteration) Common Configuration Examples Additional Notes Migration Guide Fine-Grained Activation Offloading Features Usage Max inflight offloads Compatible With Fine-Grained Recomputation Mixture of Experts Multi-Token Prediction (MTP) Multi-Latent Attention Design Document: MoE Router Replay Feature Megatron Core MoE What’s New Overview of MCore MoE Supported Features and Architectures Quick Start Guide Best Practices to achieve high performance on MoE training Feature Documentation Training Optimizations MoE Arguments Reference Examples Contributing Support Citation Context Parallel Package Context Parallelism Overview Context Parallelism Benefits Enabling Context Parallelism Megatron-FSDP ✨ Overview 🚀 Quick Start Megatron-FSDP Feature Guide & API Distributed Optimizer Data Flow Sharding Scheme Key Steps Optimizer CPU Offload How to use ? Configuration Recommendations MoE Paged Stash Prerequisites Configuration Tuning (paged stashing only) What moe_expert_rank_capacity_factor and moe_paged_stash_buffer_size_factor_cuda mean Choosing moe_expert_rank_capacity_factor and stash buffer scales Custom Pipeline Model Parallel Layout Tokenizers Overview Key Features Basic Usage Advanced Usage Integration with Megatron-LM Supported Tokenizer Libraries Common Tokenizer Types Recommendations Next Steps Megatron Energon Overview Installation Key Features Basic Usage Multimodal Example Dataset Blending Configuration Integration with Megatron-LM Resources Next Steps Megatron RL Overview Key Features Architecture Use Cases Resources