Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Performance#

Training Performance Results#

We measured the throughput of training VideNeVA models on different numbers of DGX H100 nodes and achieved near-linear scaling on the DGX H100 platform.

The following table and chart show the pretraining performance results for the NVIDIA DGX SuperPODs (16 x 8 x H100 80GB for VideoNeVA Llama2 Chat 13B Model Pretraining).

1

2

4

8

16

VideoNeVA Llama2 Chat 13B

Samples per Second

53

106

211

424

822

Perfect Linear Scaling (Samples)

37

107

214

428

857

Speedup

1x

1.99x

3.94x

7.93x

15.36x

../../../_images/VideoNeVA%20Llama2%20Chat%2013B%20NeMo%20Throughput%20%28H100%29.svg