Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Performance#

The NVIDIA NeMo Framework accelerates the entire AI workflow end-to-end, from data preparation to model training to inference. It provides optimal performance for training advanced generative AI models by incorporating the most recent training techniques, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput. For inference, the NeMo Framework provides a path that leverages TensorRT-LLM, a specialized library for accelerating and optimizing LLM inference on NVIDIA GPUs.

Below, you can see performance benchmarks for various large language models.

Performance Summary for Large Language Models#

Pretraining#

The table below shows the pre-training performance of various models at FP8 precision.

Model

#-GPUs

GBS

MBS

Sequence Length

TP

PP

CP

VP

Tokens / sec / GPU

Model TFLOP / sec / GPU

Est. time to train in days (10T tokens, 1K GPUs)

GPT3-5B

64

2048

4

2048

1

1

1

1

23406

765

5

GPT3-20B

64

256

2

2048

2

1

1

1

5851

750

19

GPT3-175B

128

256

1

2048

4

8

1

6

716

771

158

GPT3-175B

512

2048

2

2048

4

8

1

6

825

888

137

LLAMA2-7B

8

128

1

4096

1

1

1

1

16934

780

7

LLAMA2-13B

16

128

1

4096

1

4

1

10

8715

760

13

LLAMA2-70B

64

128

1

4096

4

4

1

20

1728

768

65

Nemotron-8B

64

256

4

4096

2

1

1

1

12507

643

9

Nemotron-22B

64

256

2

4096

2

4

1

10

4312

562

26

Nemotron-340B

128

32

1

4096

8

8

1

12

326

686

347

LLAMA3-8B

8

128

1

8192

1

1

2

1

12273

711

9

LLAMA3-70B

64

128

1

8192

4

4

2

5

1524

734

74

Fine-Tuning#

The table below presents the fine-tuning performance of LLaMA2 models using Supervised Fine-Tuning (SFT) and Low-Rank Adaptors (LoRA) at FP8 precision.

For fine-tuning, we use the SQuAD-v1.1 dataset, with inputs packed to 4096 tokens.

Model

Task

#-GPUs

GBS

MBS

Packed Sequence Length

TP

PP

Tokens / sec / GPU

Model TFLOP / sec / GPU

Est. time to complete in mins (10M tokens)

LLAMA2-7B

SFT

8

32

1

4096

1

1

16891

673

1.2

LLAMA2-13B

SFT

8

32

1

4096

1

4

10176

787

2.0

LLAMA2-70B

SFT

16

32

1

4096

4

4

1816

749

5.7

LLAMA2-7B

LoRA

8

32

1

4096

1

1

24824

663

0.8

LLAMA2-13B

LoRA

8

32

1

4096

1

1

14629

757

1.4

LLAMA2-70B

LoRA

8

32

1

4096

2

4

2621

722

7.9