T5 Results

Training Accuracy Results

You can also prompt-learn on top of any .nemo trained checkpoint file on the SQuAD task mentioned in the section T5 and mT5 Prompt Learning. The results are shown in the table below.

Task

Metric

220M

3B

SQuAD

Exact Match

74.20

78.52

SQuAD

F1

84.54

87.17

Training the 220M T5 model to convergence takes 4 days, and the loss curve is shown in the figure below:

../../_images/220M_T5_loss_final.svg

220M T5 Training Loss

The table below shows the converged training loss, the throughput, and the total time to train for the 220M T5 model, using a given number of GPUs and a given Global Batch Size (GBS).

Number of GPUs GBS Seq Length Number of Tokens Loss Throughput (Tokens/sec) Time to Train (days)
32 2048 512 1T 1.501 3,273,728 4

Training the 3B T5 model to convergence takes 11 days, and the loss curve of a fully trained model can be seen in the figure below:

../../_images/3B_T5_loss_100percent.svg

3B T5 Training Loss

The table below shows the converged training loss, the throughput, and the total time to train for the 3B T5 model, using a given number of GPUs and a given Global Batch Size (GBS).

#GPUs GBS Seq Length #Tokens Loss Throughput (Tokens/sec) Time to Train (days)
160 2160 512 1T 1.147 1,395,131 11

Training Performance Results

Training performance: NVIDIA DGX SuperPOD (20 × 8 × A100 80GB for 3B T5 Model)

NVIDIA measured the throughput of training a 3B parameter T5 model on NVIDIA DGX SuperPOD using different numbers of nodes. Scaling from 1 node to 20 nodes yielded a 16.38× speed-up.

NVIDIA is actively working on improving the scaling performance for T5 models. The table and chart below show the performance results.

Nodes

1

2

4

5

10

20

Tokens per Second

110769

215579

417644

515100

957506

1626353

3B

Perfect Linear Scaling (Tokens)

110769

221538

443077

553846

1107692

2215385

Speed-up

1x

1.95x

3.77x

4.65x

8.64x

14.68x

../../_images/3B_T5_throughput_2208.svg

3B T5 NeMo Framework Throughput

Inference Performance

Inference performance was measured for NVIDIA DGX SuperPOD (1 × 8 × A100 80GB). The results are shown in the table below.

Inference configurations:

  • Batch size: 1

  • Input tokens length: 60

  • Output tokens length: 20

../../_images/infer_model_size_t5.svg

Average Latency vs T5 Model Size

T5 Model size

Average latency [ms]

TP

PP

GPUs

3B

94

2

1

2

11B

123

4

1

4

23B

213

4

1

4

41B

332

8

1

8