Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

T5 Results

Training Accuracy Results

You can also prompt-learn on top of any .nemo trained checkpoint file on the SQuAD task mentioned in the section T5 and mT5 Prompt Learning. The results are shown in the table below.

Task	Metric	220M	3B
SQuAD	Exact Match	74.20	78.52
SQuAD	F1	84.54	87.17

Training the 220M T5 model to convergence takes 4 days, and the loss curve is shown in the figure below:

../../_images/220M_T5_loss_final.svg — 220M T5 Training Loss

The table below shows the converged training loss, the throughput, and the total time to train for the 220M T5 model, using a given number of GPUs and a given Global Batch Size (GBS).

Number of GPUs	GBS	Seq Length	Number of Tokens	Loss	Throughput (Tokens/sec)	Time to Train (days)
32	2048	512	1T	1.501	3,273,728	4

Training the 3B T5 model to convergence takes 11 days, and the loss curve of a fully trained model can be seen in the figure below:

../../_images/3B_T5_loss_100percent.svg — 3B T5 Training Loss

The table below shows the converged training loss, the throughput, and the total time to train for the 3B T5 model, using a given number of GPUs and a given Global Batch Size (GBS).

#GPUs	GBS	Seq Length	#Tokens	Loss	Throughput (Tokens/sec)	Time to Train (days)
160	2160	512	1T	1.147	1,395,131	11

Training Performance Results

Training performance: NVIDIA DGX SuperPOD (20 × 8 × A100 80GB for 3B T5 Model)

NVIDIA measured the throughput of training a 3B parameter T5 model on NVIDIA DGX SuperPOD using different numbers of nodes. Scaling from 1 node to 20 nodes yielded a 16.38× speed-up.

NVIDIA is actively working on improving the scaling performance for T5 models. The table and chart below show the performance results.

					Nodes
		1	2	4	5	10	20
	Tokens per Second	110769	215579	417644	515100	957506	1626353
3B	Perfect Linear Scaling (Tokens)	110769	221538	443077	553846	1107692	2215385
	Speed-up	1x	1.95x	3.77x	4.65x	8.64x	14.68x

../../_images/3B_T5_throughput_2208.svg — 3B T5 NeMo Framework Throughput

Inference Performance

Inference performance was measured for NVIDIA DGX SuperPOD (1 × 8 × A100 80GB). The results are shown in the table below.

Inference configurations:

Batch size: 1
Input tokens length: 60
Output tokens length: 20

../../_images/infer_model_size_t5.svg — Average Latency vs T5 Model Size

T5 Model size	Average latency [ms]	TP	PP	GPUs
3B	94	2	1	2
11B	123	4	1	4
23B	213	4	1	4
41B	332	8	1	8