Performance Summary
This document provides performance benchmarks for various large language models using NeMo AutoModel with the PyTorch backend.
Pre-Training Performance
The table below shows training performance for full sequences with no padding across different model architectures and scales.
System: DGX-H100, Precision: BF16
Fine-Tuning (LoRA) Performance
The table below shows fine-tuning (LoRA) performance for full sequences with no padding across different model architectures and scales.
System: DGX-H100, Precision: BF16
Glossary
- MFU: Model FLOPs Utilization - ratio of achieved compute to peak hardware capability
- TP: Tensor Parallelism - splits individual layers across GPUs
- PP: Pipeline Parallelism - splits model layers into stages
- EP: Expert Parallelism - distributes MoE experts across GPUs
- DP: Data Parallelism - replicates model and splits data
- VP: Virtual Pipeline - number of pipeline stages per GPU for interleaving
- MBS: Micro-Batch Size - size of one forward pass in pipeline
- LBS: Local Batch Size - size of one step per GPU
- GBS: Global Batch Size - total batch size across all GPUs
- GA: Gradient Accumulation - number of local-batches before optimizer step
- TE: Transformer Engine kernel optimizations - RMSNorm, Linear and DotProductAttention
- DeepEP: Deep Expert Parallelism - advanced EP routing for MoE models
- FlexAttn: PyTorch’s Flex Attention
Configuration Files
Pre-training and fine-tuning (LoRA) benchmark configurations are available in examples/llm_benchmark/:
deepseek_v3_te_deepep.yaml- DeepSeek V3 with TE + DeepEPkimi_k2_te_deepep.yaml- Kimi K2 optimized configurationqwen3_moe_30b_te_deepep.yaml- Qwen3 MoE with TE + DeepEPgptoss_20b_te_deepep.yaml- GPT-OSS 20B with optimizationsgptoss_120b_te_deepep.yaml- GPT-OSS 120B optimizedcustom_llama3_1_70b_pretrain_benchmark_8nodes.yaml- Llama3-70B optimizedllama3_1_8b_peft_benchmark.yaml- Llama-8B fine-tuning (LoRA) optimizedqwen2_5_7b_peft_benchmark.yaml- Qwen2.5-7B fine-tuning (LoRA) optimizedcustom_llama3_3_70b_instruct_peft_benchmark.yaml- Llama-70B fine-tuning (LoRA) optimizedcustom_qwen2_5_32b_peft_benchmark.yaml- Qwen2.5-32B fine-tuning (LoRA) optimizedcustom_llama3_3_70b_instruct_peft_benchmark_2nodes.yaml- Llama-70B fine-tuning (LoRA) optimized on 2 nodescustom_qwen2_5_32b_peft_benchmark_2nodes.yaml- Qwen2.5-32B fine-tuning (LoRA) optimized on 2 nodes
- All benchmarks use mock data for consistent performance measurement.
- Fake balanced gate is enabled to simulate ideal expert routing.
- No gradient clipping applied for pure performance measurement.
- MFU calculated using peak TFLOPs for the system (989 for BF16 H100).
- Step times include forward and backward passes + optimizer step for the global batch.
Version Information
- Last Updated: 2025-10-02
- NeMo AutoModel Version:
mainBranch