Performance#

The NVIDIA NeMo Framework accelerates the entire AI workflow end-to-end, from data preparation to model training to inference. It provides optimal performance for training advanced generative AI models by incorporating the most recent training techniques, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput. For inference, the NeMo Framework provides a path that leverages TensorRT-LLM, a specialized library for accelerating and optimizing LLM inference on NVIDIA GPUs.

Performance Summary for Large Language Models#

Below are performance benchmarks for various large language models. These results were obtained using a version of the performance recipes available here.

  • Abbreviations:

    • MBS: Micro Batch Size

    • GBS: Global Batch Size

    • TP: Tensor Parallel Size

    • PP: Pipeline Parallel Size

    • CP: Context Parallel Size

    • VP: Virtual Pipeline Parallel Size

    • EP: Expert Parallel Size

    • GA: Number of Gradient Accumulations

Pretraining#

The table below shows the pre-training performance of various models at FP8 precision (using NeMo 2.0).

Model

#-GPUs

GBS

MBS

Sequence Length

TP

PP

CP

VP

EP

GA

Tokens / sec / GPU

Model TFLOP / sec / GPU

GPT3-175B

512

2048

2

2048

4

4

1

6

1

32

1600

1722

LLAMA3-8B

8

128

2

8192

1

1

1

1

1

8

26006

1506

LLAMA3-70B

64

128

1

8192

2

4

2

5

1

32

3062

1474

LLAMA3-405B

128

64

1

8192

4

8

2

8

1

32

625

1658

Nemotron-15B

64

256

2

4096

1

1

1

1

1

2

14760

1387

Nemotron-340B

128

32

1

4096

8

4

1

12

1

8

602

1268

Mixtral-8x7B

64

256

2

4096

1

1

1

1

8

2

15457

1282

Mixtral-8x22B

256

1

64

65536

2

4

8

14

8

16

2232

824

  • System: DGX-H100

Model

#-GPUs

GBS

MBS

Sequence Length

TP

PP

CP

VP

EP

GA

Tokens / sec / GPU

Model TFLOP / sec / GPU

GPT3-175B

512

2048

2

2048

4

8

1

6

1

64

866

932

LLAMA3-8B

8

128

1

8192

1

1

2

1

1

32

14201

822

LLAMA3-70B

64

128

1

8192

4

8

1

5

1

64

1662

800

LLAMA3-405B

1024

512

1

8192

8

8

2

8

1

64

327

866

Nemotron-15B

64

256

2

4096

2

1

1

1

1

4

8233

774

Nemotron-340B

256

64

1

4096

8

8

1

12

1

16

346

728

Mixtral-8x7B

64

256

1

4096

1

4

1

8

8

16

8233

683

Mixtral-8x22B

256

1

256

65536

4

4

8

14

8

32

1278

471

Fine-Tuning#

The table below presents the fine-tuning performance of LLaMA2 models using Supervised Fine-Tuning (SFT) and Low-Rank Adaptors (LoRA) at FP8 precision (using NeMo 2.0).

For fine-tuning, we use the SQuAD-v1.1 dataset, with inputs packed to 4096 tokens.

  • System: DGX-B200

Model

Task

#-GPUs

GBS

MBS

Packed Sequence Length

TP

PP

VP

GA

Tokens / sec / GPU

Model TFLOP / sec / GPU

LLAMA3-8B

SFT

8

8

1

16384

1

1

1

1

31508

1428

LLAMA3-70B

SFT

32

32

1

4096

2

4

5

8

3357

1400

LLAMA3-8B

LoRA

8

8

1

16384

1

1

1

1

43116

1307

LLAMA3-70B

LoRA

8

32

1

4096

1

4

20

16

5669

1579

LLAMA3-405B

LoRA

32

32

1

2048

4

4

4

16

759

1231

  • System: DGX-H100

Model

Task

#-GPUs

GBS

MBS

Packed Sequence Length

TP

PP

VP

GA

Tokens / sec / GPU

Model TFLOP / sec / GPU

LLAMA3-8B

SFT

8

32

1

4096

1

1

1

4

17246

779

LLAMA3-70B

SFT

32

32

1

4096

4

4

5

16

1789

746

LLAMA3-8B

LoRA

8

32

1

4096

1

1

1

4

23406

707

LLAMA3-70B

LoRA

8

32

1

4096

2

4

20

32

2768

771

LLAMA3-405B

LoRA

32

32

1

2048

4

8

8

32

521

846

Performance Numbers for previous NeMo Container Releases#

24.12 NeMo container