Performance#

The NVIDIA NeMo Framework accelerates the entire AI workflow end-to-end, from data preparation to model training to inference. It provides optimal performance for training advanced generative AI models by incorporating the most recent training techniques, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput. For inference, the NeMo Framework provides a path that leverages TensorRT-LLM, a specialized library for accelerating and optimizing LLM inference on NVIDIA GPUs.

Performance Summary for Large Language Models#

Below are performance benchmarks for various large language models. These results were obtained using a version of the performance recipes available here.

  • Abbreviations:

    • GBS: Global Batch Size

    • MBS: Micro Batch Size

    • FSDP: Fully Sharded Data Parallel Size

    • TP: Tensor Parallel Size

    • PP: Pipeline Parallel Size

    • CP: Context Parallel Size

    • VP: Virtual Pipeline Parallel Size

    • EP: Expert Parallel Size

    • GA: Number of Gradient Accumulations

Pre-training#

The table below shows the pre-training performance of various models with the FP8 precision. Specifically, we use per-tensor FP8 quantization using the scaling factors calculated in the current step (for both pre-training and fine-tuning).

Model

#-GPUs

GBS

MBS

Sequence Length

FSDP

TP

PP

CP

VP

EP

GA

Tokens / sec / GPU

Model TFLOP / sec / GPU

GPT3-175B

128

256

2

2048

0

4

4

1

12

1

16

1804

1942

LLAMA3-8B

8

128

2

8192

0

1

1

1

1

1

8

33099

1917

LLAMA3-70B

64

128

1

8192

1

1

1

1

1

1

2

4035

1943

LLAMA3-405B

128

64

1

8192

0

4

8

2

8

1

32

665

1763

Nemotron-15B

64

256

2

4096

0

1

1

1

1

1

2

17809

1674

Nemotron-340B

128

32

1

4096

0

8

4

1

12

1

8

737

1551

Mixtral-8x7B

64

256

2

4096

0

1

1

1

1

8

2

19275

1598

Mixtral-8x22B

256

1

64

65536

0

2

4

8

14

8

16

2731

1008

  • System: DGX-B200

Model

#-GPUs

GBS

MBS

Sequence Length

TP

PP

CP

VP

EP

GA

Tokens / sec / GPU

Model TFLOP / sec / GPU

GPT3-175B

512

2048

2

2048

4

4

1

6

1

32

1523

1639

LLAMA3-8B

8

128

2

8192

1

1

1

1

1

8

30411

1761

LLAMA3-70B

64

128

1

8192

2

4

2

5

1

32

3562

1715

LLAMA3-405B

128

64

1

8192

4

8

2

8

1

32

651

1726

Nemotron-15B

64

256

2

4096

1

1

1

1

1

2

16222

1525

Nemotron-340B

128

32

1

4096

8

4

1

12

1

8

632

1330

Mixtral-8x7B

64

256

2

4096

1

1

1

1

8

2

17617

1461

Mixtral-8x22B

256

1

64

65536

2

4

8

14

8

16

2399

885

  • System: DGX-H100

Model

#-GPUs

GBS

MBS

Sequence Length

TP

PP

CP

VP

EP

GA

Tokens / sec / GPU

Model TFLOP / sec / GPU

GPT3-175B

512

2048

2

2048

4

8

1

6

1

64

824

887

LLAMA3-8B

8

128

1

8192

1

1

2

1

1

32

13812

800

LLAMA3-70B

64

128

1

8192

4

8

1

5

1

64

1621

780

LLAMA3-405B

1024

512

1

8192

8

8

2

8

1

64

315

834

Nemotron-15B

64

256

2

4096

2

1

1

1

1

4

7915

744

Nemotron-340B

256

64

1

4096

8

8

1

12

1

16

335

704

Mixtral-8x7B

64

256

1

4096

1

4

1

8

8

16

7992

663

Mixtral-8x22B

256

1

256

65536

4

4

8

14

8

32

1277

471

Fine-Tuning#

The table below presents the fine-tuning performance of Llama3 models using Supervised Fine-Tuning (SFT) and Low-Rank Adaptors (LoRA) at FP8 precision (using NeMo 2.0).

For fine-tuning, we use the SQuAD-v1.1 dataset, with inputs packed to 4096 tokens.

  • System: DGX-GB200

Model

Task

#-GPUs

GBS

MBS

Packed Sequence Length

TP

PP

VP

GA

Tokens / sec / GPU

Model TFLOP / sec / GPU

LLAMA3-8B

SFT

8

8

1

16384

1

1

1

1

33437

1515

LLAMA3-70B

SFT

32

32

1

4096

2

4

5

8

3977

1658

LLAMA3-8B

LoRA

8

8

1

16384

1

1

1

1

44281

1342

LLAMA3-70B

LoRA

8

32

1

4096

1

4

20

16

3238

1804

LLAMA3-405B

LoRA

32

32

1

2048

4

4

4

16

966

1568

  • System: DGX-B200

Model

Task

#-GPUs

GBS

MBS

Packed Sequence Length

TP

PP

VP

GA

Tokens / sec / GPU

Model TFLOP / sec / GPU

LLAMA3-8B

SFT

8

8

1

16384

1

1

1

1

30913

1401

LLAMA3-70B

SFT

32

32

1

4096

2

4

5

8

3657

1525

LLAMA3-8B

LoRA

8

8

1

16384

1

1

1

1

42010

1274

LLAMA3-70B

LoRA

8

32

1

4096

1

4

20

16

5611

1563

LLAMA3-405B

LoRA

32

32

1

2048

4

4

4

16

678

1101

  • System: DGX-H100

Model

Task

#-GPUs

GBS

MBS

Packed Sequence Length

TP

PP

VP

GA

Tokens / sec / GPU

Model TFLOP / sec / GPU

LLAMA3-8B

SFT

8

32

1

4096

1

1

1

4

16222

733

LLAMA3-70B

SFT

32

32

1

4096

4

4

5

16

1619

675

LLAMA3-8B

LoRA

8

32

1

4096

1

1

1

4

22141

669

LLAMA3-70B

LoRA

8

32

1

4096

2

4

20

32

2621

730

LLAMA3-405B

LoRA

32

32

1

2048

4

8

8

32

496

805

Performance Numbers for previous NeMo Container Releases#

Performance Summary Archive