Changelog#
NVIDIA Megatron-Bridge 0.2.0#
Model Collection Support
LLM
HuggingFace Conversion + training recipes:
GPT-oss
Qwen3 Next
Nemotron-H
Nemotron Nano v2
Moonlight
OlMoE
GLM 4.5
Gemma 3
HuggingFace conversion support:
Llama Nemotron
Mistral
Gemma
Gemma 2
VLM
Nemotron Nano v2 VL
Qwen 3 VL
Qwen2.5 VL
Gemma3 VL
Performance
Megatron-Bridge support for new benchmarks
Benchmarks (same workloads as GB200 system) for GB300 system
GPT-OSS 120B
Qwen3-Next 80B_a3B
Support for linear attention on Blackwell - Gated Delta Networks
Pre-training with NVFP4 precision: Llama3 8B, Lama3 70B, Llama3.1 405B
Megatron-Bridge support for benchmarks previously existing only for NeMo 2.0
Nemotron-H 56B
Fine-tuning (SFT and LoRA): Llama3 8B and Llama3 70B
HybridEP: DeepSeek V3 benchmarks on GB200 and GB300 systems now use HybridEP
CUDA Graphs
Full-model iteration CUDA graph used for dense models- Llama3 8B, Llama3 70B, Llama3.1 405B
Fine-grained Transformer component specific CUDA Graphs used for MoE models
NVIDIA Model Optimization Integration
Knowledge Distillation
Post training quantization export
quantization aware training
Enhanced LoRA support
Support for expert layers
Supported merging adapters for export to HuggingFace
Finetuning dataset improvements: OpenAI messages format conversion, chat template support
Integration with Tensor NVIDIA-DLFW-Inspect for tensor statistic collection & monitoring
Support for sample-based training
NVIDIA Megatron-Bridge 0.1.0rc4#
Fix docs build
Update performance scripts
NVIDIA Megatron-Bridge 0.1.0rc3#
Model Collection Support
Llama
Qwen 2, Qwen 3, Qwen 3 MoE
DeepSeek
Mamba
-
MoE LLM
Change the model to dropless with balanced gating
Fusion of operators in router function
Global permutation fusion with A2A dispatcher
EP A2A communication overlap with computation in both 1F1B pipelining and non-pipelined training
Precision-aware optimizer update to support BF16 states
Megatron FSDP
Migration from mcore FSDP to megatron FSDP
Fusion of weight gradient copy to reduce-scatter communication buffer to WGRAD GEMM
Removed redundant optimizer operations
Use Zero1 (opt and master param sharding) in the replica domain of hybrid FSDP to further lower memory usage
IB-SHARP support for the IB AllReduce of hybrid FSDP in a patch with NCCL2.28
MXFP8
Improved act grad all-gather overlap performance via userbuffer
Parameter all-gather overlap with computation while the communication buffer sharing with reduce-scatter
Fusion of MXFP8 scaling factor swizzling kernels
Use PDL (Programmatic Dependent Launch) for quantization kernels to lower CPU overhead
Others
Full iteration cuda graph for dense model without pipelining
Fusion of activation and cast fusion (currently tensor-wise scaling only)
Store SwiGLU input in FP8 to save activation memory
NVIDIA Megatron-Bridge 0.1.0a0#
Llama and Qwen
Pretrain/SFT
PeFT
Recipe structure with examples for plain python & NeMo Run usage