BERT Results
Training Accuracy Results
Training accuracy: NVIDIA DGX SuperPOD (16 × 8 × A100 80GB for 4b BERT model)
Training the 4B BERT model for 95 Billion takes 1.5 days. The figure below shows the loss curve.
![../../_images/4b_bert_loss_final.png](../../_images/4b_bert_loss_final.png)
4B BERT Training Loss (220B Tokens)
The table below shows the converged training loss, the throughput, and the total time to train for the 4B BERT model, using a given number of GPUs and a given Global Batch Size (GBS).
Training Performance Results
Training performance: NVIDIA DGX SuperPOD (20 × 8 × A100 80GB for 4B BERT model)
NVIDIA measured the throughput of training a 4B parameter BERT model on NVIDIA DGX SuperPOD using different numbers of nodes. Scaling from 1 node to 16 nodes yielded a 12.71× speed-up. The table and chart below show the performance results.
Nodes |
||||||
---|---|---|---|---|---|---|
1 |
2 |
4 |
8 |
16 |
||
Tokens per Second |
57287 |
108695 |
215358 |
393167 |
728178 |
|
4B |
Perfect Linear Scaling (Tokens) |
57287 |
114574 |
229148 |
458296 |
916592 |
Speed-up |
1x |
1.89x |
3.75x |
6.86x |
12.71x |
![../../_images/4B_bert_throughput_2211.png](../../_images/4B_bert_throughput_2211.png)
4B BERT NeMo Framework Throughput