BERT Results

Training Accuracy Results

Training accuracy: NVIDIA DGX SuperPOD (16 × 8 × A100 80GB for 4b BERT model)

Training the 4B BERT model for 95 Billion takes 1.5 days. The figure below shows the loss curve.

../../_images/4b_bert_loss_final.png

4B BERT Training Loss (220B Tokens)

The table below shows the converged training loss, the throughput, and the total time to train for the 4B BERT model, using a given number of GPUs and a given Global Batch Size (GBS).

Training Performance Results

Training performance: NVIDIA DGX SuperPOD (20 × 8 × A100 80GB for 4B BERT model)

NVIDIA measured the throughput of training a 4B parameter BERT model on NVIDIA DGX SuperPOD using different numbers of nodes. Scaling from 1 node to 16 nodes yielded a 12.71× speed-up. The table and chart below show the performance results.

Nodes

1

2

4

8

16

Tokens per Second

57287

108695

215358

393167

728178

4B

Perfect Linear Scaling (Tokens)

57287

114574

229148

458296

916592

Speed-up

1x

1.89x

3.75x

6.86x

12.71x

../../_images/4B_bert_throughput_2211.png

4B BERT NeMo Framework Throughput