Performance#

As part of the NVIDIA NeMo Framework, Megatron Bridge, provides optimal performance for training advanced generative AI models by incorporating the most recent training techniques, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput.

This page provides performance benchmarks for large language models using Megatron-Bridge across different GPU systems and configurations.

Nomenclature#

  • GBS: Global Batch Size

  • MBS: Micro Batch Size

  • FSDP: Fully Sharded Data Parallel

    • FSDP = 1: use FSDP

    • FSDP = 0: use DDP (Distributed Data Parallel)

  • TP: Tensor Parallel Size

  • PP: Pipeline Parallel Size

  • CP: Context Parallel Size

  • VP: Virtual Pipeline Parallel Size

  • EP: Expert Parallel Size

  • GA: Number of Gradient Accumulations

Performance Metrics#

Performance is measured using:

  • Tokens/sec/GPU: Throughput per GPU

  • Model TFLOP/sec/GPU: Model floating-point operations per second per GPU

Performance Summary for Large Language Models#

Below are performance benchmarks for various large language models organized by release version. These results were obtained using performance recipes available here.

The performance data includes:

  • Pre-training Performance: Throughput metrics for various model sizes and architectures

  • System Configurations: Results across different GPU systems (DGX-GB200, DGX-B200, DGX-H100)

  • Precision Options: Performance comparisons between different precision modes (BF16, FP8, MXFP8)


25.09 NeMo Container#

Pre-Training Performance#

System: DGX-GB200#

Performance tables will be added here

System: DGX-B200#

Performance tables will be added here

System: DGX-H100#

Performance tables will be added here