Changelog#

NVIDIA Megatron-Bridge 0.2.0#

  • Model Collection Support

    • LLM

      • HuggingFace Conversion + training recipes:

        • GPT-oss

        • Qwen3 Next

        • Nemotron-H

        • Nemotron Nano v2

        • Moonlight

        • OlMoE

        • GLM 4.5

        • Gemma 3

      • HuggingFace conversion support:

        • Llama Nemotron

        • Mistral

        • Gemma

        • Gemma 2

    • VLM

      • Nemotron Nano v2 VL

      • Qwen 3 VL

      • Qwen2.5 VL

      • Gemma3 VL

  • Performance

    • Megatron-Bridge support for new benchmarks

      • Benchmarks (same workloads as GB200 system) for GB300 system

      • GPT-OSS 120B

      • Qwen3-Next 80B_a3B

      • Support for linear attention on Blackwell - Gated Delta Networks

      • Pre-training with NVFP4 precision: Llama3 8B, Lama3 70B, Llama3.1 405B

    • Megatron-Bridge support for benchmarks previously existing only for NeMo 2.0

      • Nemotron-H 56B

      • Fine-tuning (SFT and LoRA): Llama3 8B and Llama3 70B

    • HybridEP: DeepSeek V3 benchmarks on GB200 and GB300 systems now use HybridEP

    • CUDA Graphs

      • Full-model iteration CUDA graph used for dense models- Llama3 8B, Llama3 70B, Llama3.1 405B

      • Fine-grained Transformer component specific CUDA Graphs used for MoE models

  • NVIDIA Model Optimization Integration

    • Knowledge Distillation

    • Post training quantization export

    • quantization aware training

  • Enhanced LoRA support

    • Support for expert layers

    • Supported merging adapters for export to HuggingFace

  • Finetuning dataset improvements: OpenAI messages format conversion, chat template support

  • Integration with Tensor NVIDIA-DLFW-Inspect for tensor statistic collection & monitoring

  • Support for sample-based training

NVIDIA Megatron-Bridge 0.1.0rc4#

  • Fix docs build

  • Update performance scripts

NVIDIA Megatron-Bridge 0.1.0rc3#

  • Model Collection Support

    • Llama

    • Qwen 2, Qwen 3, Qwen 3 MoE

    • DeepSeek

    • Mamba

  • Migration guide from NeMo 2 to Megatron-Bridge

  • Contribution guide for adding a new model

  • Checkpoint conversion from Hugging Face to Megatron

  • Performance

    • MoE LLM

      • Change the model to dropless with balanced gating

      • Fusion of operators in router function

      • Global permutation fusion with A2A dispatcher

      • EP A2A communication overlap with computation in both 1F1B pipelining and non-pipelined training

      • Precision-aware optimizer update to support BF16 states

    • Megatron FSDP

      • Migration from mcore FSDP to megatron FSDP

      • Fusion of weight gradient copy to reduce-scatter communication buffer to WGRAD GEMM

      • Removed redundant optimizer operations

      • Use Zero1 (opt and master param sharding) in the replica domain of hybrid FSDP to further lower memory usage

      • IB-SHARP support for the IB AllReduce of hybrid FSDP in a patch with NCCL2.28

    • MXFP8

      • Improved act grad all-gather overlap performance via userbuffer

      • Parameter all-gather overlap with computation while the communication buffer sharing with reduce-scatter

      • Fusion of MXFP8 scaling factor swizzling kernels

      • Use PDL (Programmatic Dependent Launch) for quantization kernels to lower CPU overhead

    • Others

      • Full iteration cuda graph for dense model without pipelining

      • Fusion of activation and cast fusion (currently tensor-wise scaling only)

      • Store SwiGLU input in FP8 to save activation memory

NVIDIA Megatron-Bridge 0.1.0a0#

  • Llama and Qwen

  • Pretrain/SFT

  • PeFT

  • Recipe structure with examples for plain python & NeMo Run usage