Performance Summary#

This document provides performance benchmarks for various large language models using NeMo AutoModel with the PyTorch backend.

Pre-Training Performance#

The table below shows training performance for full sequences with no padding across different model architectures and scales.

System: DGX-H100, Precision: BF16#

Model

#GPUs

GBS

MBS

LBS

GA

Seq Length

TP

PP

CP

EP

VP

FSDP

Kernel Optimizations

Time per Global Step (s)

Model TFLOPs/sec/GPU

Tokens/sec/GPU

Nemotron V3 Super 120B (26.02)

64

512

2

2

4

4096

1

1

1

64

-

64

TE + DeepEP + TorchSDPA

7.286

334

4,497

Nemotron V3 Nano 30B (26.02)

8

512

4

4

16

4096

1

1

1

8

-

8

TE + DeepEP + TorchSDPA

15.614

328

16,789

DeepSeek V3 671B

1024

8192

1

8

4

4096

1

4

1

64

8

256

TE + DeepEP

37.87

216

865

DeepSeek V3 671B

256

512

1

8

1

4096

1

4

1

64

8

64

TE + DeepEP

8.18

250

1,002

Kimi K2

256

512

1

8

2

4096

1

8

1

32

4

32

TE + DeepEP

8.86

189

924

Qwen3 MoE 30B

8

512

4

4

16

4096

1

1

1

8

-

8

TE + DeepEP

21.773

277

12,040

GPT-OSS 20B

8

256

2

2

16

4096

1

1

1

-

-

8

TE + DeepEP + FlexAttn

10.04

279

13,058

GPT-OSS 120B

64

512

2

2

4

4096

1

1

1

-

-

64

TE + DeepEP + FlexAttn

4.30

231

7,626

Llama3 70B

64

128

1

1

4

8192

1

1

2

-

-

32

TE + fsdp2_prefetch

18.90

389

866.77

Fine-Tuning (LoRA) Performance#

The table below shows fine-tuning (LoRA) performance for full sequences with no padding across different model architectures and scales.

System: DGX-H100, Precision: BF16#

Model

#GPUs

GBS

MBS

LBS

GA

Seq Length

TP

PP

CP

EP

VP

FSDP

Kernel Optimizations

Time per Global Step (s)

Model TFLOPs/sec/GPU

Tokens/sec/GPU

Llama3 8B

1

32

2

2

16

4096

1

1

1

-

1

1

TE + triton

10.51

402

12472.87

Qwen2.5 7B

1

32

2

2

16

4096

1

1

1

-

1

1

TE + triton

9.29

423

14110.05

Llama3 70B

8

32

2

2

4

4096

2

1

1

-

1

4

TE + triton + fsdp2_prefetch

15.00

316

1091.85

Qwen2.5 32B

8

32

2

2

4

4096

2

1

1

-

1

4

TE + triton + fsdp2_prefetch

7.28

301

2250.31

Llama3 70B 2-node

16

32

2

2

2

4096

2

1

1

-

1

8

TE + triton + fsdp2_prefetch

8.32

285

984.85

Qwen2.5 32B 2-node

16

32

2

2

2

4096

2

1

1

-

1

8

TE + triton + fsdp2_prefetch

3.95

277

2072.89

Glossary#

  • MFU: Model FLOPs Utilization - ratio of achieved compute to peak hardware capability

  • TP: Tensor Parallelism - splits individual layers across GPUs

  • PP: Pipeline Parallelism - splits model layers into stages

  • EP: Expert Parallelism - distributes MoE experts across GPUs

  • DP: Data Parallelism - replicates model and splits data

  • VP: Virtual Pipeline - number of pipeline stages per GPU for interleaving

  • MBS: Micro-Batch Size - size of one forward pass in pipeline

  • LBS: Local Batch Size - size of one step per GPU

  • GBS: Global Batch Size - total batch size across all GPUs

  • GA: Gradient Accumulation - number of local-batches before optimizer step

  • TE: Transformer Engine kernel optimizations - RMSNorm, Linear and DotProductAttention

  • DeepEP: Deep Expert Parallelism - advanced EP routing for MoE models

  • FlexAttn: PyTorch’s Flex Attention

Configuration Files#

Pre-training benchmark configurations are available in examples/benchmark/configs/ and fine-tuning (LoRA) configurations are in examples/llm_finetune/:

Note

  • All benchmarks use mock data for consistent performance measurement.

  • Fake balanced gate is enabled to simulate ideal expert routing.

  • No gradient clipping applied for pure performance measurement.

  • MFU calculated using peak TFLOPs for the system (989 for BF16 H100).

  • Step times include forward and backward passes + optimizer step for the global batch.

Version Information#

  • Last Updated: 2025-10-02

  • NeMo AutoModel Version: main Branch