BERT Results

Training accuracy: NVIDIA DGX SuperPOD (16 × 8 × A100 80GB for 4b BERT model)

Training the 4B BERT model for 95 Billion takes 1.5 days. The figure below shows the loss curve.

4b_bert_loss_final.png

4B BERT Training Loss (220B Tokens)

The table below shows the converged training loss, the throughput, and the total time to train for the 4B BERT model, using a given number of GPUs and a given Global Batch Size (GBS).

Training performance: NVIDIA DGX SuperPOD (20 × 8 × A100 80GB for 4B BERT model)

NVIDIA measured the throughput of training a 4B parameter BERT model on NVIDIA DGX SuperPOD using different numbers of nodes. Scaling from 1 node to 16 nodes yielded a 12.71× speed-up. The table and chart below show the performance results.

Nodes

1 2 4 8 16

Tokens per Second 57287 108695 215358 393167 728178
4B Perfect Linear Scaling (Tokens) 57287 114574 229148 458296 916592

Speed-up 1x 1.89x 3.75x 6.86x 12.71x
4B_bert_throughput_2211.png

4B BERT NeMo Framework Throughput

Previous Training with Predefined Configurations
Next Multimodal Models
© Copyright 2023-2024, NVIDIA. Last updated on May 17, 2024.