Changelog#

NVIDIA Megatron-Bridge 0.2.0#

Model Collection Support
- LLM
  - HuggingFace Conversion + training recipes:
    - GPT-oss
    - Qwen3 Next
    - Nemotron-H
    - Nemotron Nano v2
    - Moonlight
    - OlMoE
    - GLM 4.5
    - Gemma 3
  - HuggingFace conversion support:
    - Llama Nemotron
    - Mistral
    - Gemma
    - Gemma 2
- VLM
  - Nemotron Nano v2 VL
  - Qwen 3 VL
  - Qwen2.5 VL
  - Gemma3 VL
Performance
- Megatron-Bridge support for new benchmarks
  - Benchmarks (same workloads as GB200 system) for GB300 system
  - GPT-OSS 120B
  - Qwen3-Next 80B_a3B
  - Support for linear attention on Blackwell - Gated Delta Networks
  - Pre-training with NVFP4 precision: Llama3 8B, Lama3 70B, Llama3.1 405B
- Megatron-Bridge support for benchmarks previously existing only for NeMo 2.0
  - Nemotron-H 56B
  - Fine-tuning (SFT and LoRA): Llama3 8B and Llama3 70B
- HybridEP: DeepSeek V3 benchmarks on GB200 and GB300 systems now use HybridEP
- CUDA Graphs
  - Full-model iteration CUDA graph used for dense models- Llama3 8B, Llama3 70B, Llama3.1 405B
  - Fine-grained Transformer component specific CUDA Graphs used for MoE models
NVIDIA Model Optimization Integration
- Knowledge Distillation
- Post training quantization export
- quantization aware training
Enhanced LoRA support
- Support for expert layers
- Supported merging adapters for export to HuggingFace
Finetuning dataset improvements: OpenAI messages format conversion, chat template support
Integration with Tensor NVIDIA-DLFW-Inspect for tensor statistic collection & monitoring
Support for sample-based training

NVIDIA Megatron-Bridge 0.1.0rc4#

Fix docs build
Update performance scripts

NVIDIA Megatron-Bridge 0.1.0rc3#

Model Collection Support
- Llama
- Qwen 2, Qwen 3, Qwen 3 MoE
- DeepSeek
- Mamba
Migration guide from NeMo 2 to Megatron-Bridge
Contribution guide for adding a new model
Checkpoint conversion from Hugging Face to Megatron
Performance
- MoE LLM
  - Change the model to dropless with balanced gating
  - Fusion of operators in router function
  - Global permutation fusion with A2A dispatcher
  - EP A2A communication overlap with computation in both 1F1B pipelining and non-pipelined training
  - Precision-aware optimizer update to support BF16 states
- Megatron FSDP
  - Migration from mcore FSDP to megatron FSDP
  - Fusion of weight gradient copy to reduce-scatter communication buffer to WGRAD GEMM
  - Removed redundant optimizer operations
  - Use Zero1 (opt and master param sharding) in the replica domain of hybrid FSDP to further lower memory usage
  - IB-SHARP support for the IB AllReduce of hybrid FSDP in a patch with NCCL2.28
- MXFP8
  - Improved act grad all-gather overlap performance via userbuffer
  - Parameter all-gather overlap with computation while the communication buffer sharing with reduce-scatter
  - Fusion of MXFP8 scaling factor swizzling kernels
  - Use PDL (Programmatic Dependent Launch) for quantization kernels to lower CPU overhead
- Others
  - Full iteration cuda graph for dense model without pipelining
  - Fusion of activation and cast fusion (currently tensor-wise scaling only)
  - Store SwiGLU input in FP8 to save activation memory

NVIDIA Megatron-Bridge 0.1.0a0#

Llama and Qwen
Pretrain/SFT
PeFT
Recipe structure with examples for plain python & NeMo Run usage