MT5 Results

Training Accuracy Results

Training accuracy: NVIDIA DGX SuperPOD

4 × 8 × A100 80GB for 170M mT5 model
8 × 8 × A100 80GB for 390M mT5 model
20 × 8 × A100 80GB for 3B mT5 model)

NVIDIA evaluated the mT5 models on an XQuAD task. The results are shown in the table below. You can fine-tune on top of any .nemo-trained checkpoint file on an XQuAD task mentioned in the section mT5 Fine-Tuning.

Task-Language	Metric	170M	390M
XQuAD-de	Exact Match	43.0	54.7
XQuAD-en	Exact Match	63.8	68.8
XQuAD-es	Exact Match	47.0	55.3
XQuAD-hi	Exact Match	34.5	47.1
XQuAD-zh	Exact Match	46.8	56.1

You can also prompt-learn on top of any .nemo trained checkpoint file on a SQuAD task mentioned in the section T5 and mT5 Prompt Learning.

The results are shown in the table below.

Task	Metric	390M	3B
SQuAD	Exact Match	76.86	81.55
SQuAD	F1	84.67	89.34

Training the 170M mT5 model to convergence takes 4 days. The figure below shows the loss curve.

../_images/170M_mT5_loss_final1.svg — 170M mT5 Training Loss

The table below shows the converged training loss, the throughput, and the total time to train for the 170M mT5 model, using a given number of GPUs and a given Global Batch Size (GBS).

Number of GPUs	GBS	Seq Length	Number of Tokens	Loss	Throughput (Tokens/sec)	Time to Train (days)
32	2048	512	1T	1.980	4,112,062	4

Training the 390M mT5 model to convergence takes 4 days. The figure below shows the loss curve.

../_images/390M_mT5_loss_final1.svg — 390M mT5 Training Loss

The table below shows the converged training loss, the throughput, and the total time to train for the 390M mT5 model, using a given number of GPUs and a given Global Batch Size (GBS).

Number of GPUs	GBS	Seq Length	Number of Tokens	Loss	Throughput (Tokens/sec)	Time to Train (days)
64	2048	512	1T	1.584	3,744,914	4

Training the 3B mT5 model to convergence takes 14 days. The figure below shows the loss curve of a fully trained model:

../_images/3B_mT5_loss_final1.svg — 3B mT5 Training Loss

The table below shows the converged training loss, the throughput, and the total time to train for the 3B T5 model, using a given number of GPUs and a given Global Batch Size (GBS).

Number of GPUs	GBS	Seq Length	Number of Tokens	Loss	Throughput (Tokens/sec)	Time to Train (days)
169	1920	512	1T	1.134	911,065	14

Training Performance Results

Training performance: NVIDIA DGX SuperPOD (20 × 8 × A100 80GB for 3B mT5 model)

NVIDIA measured the throughput of training a 3B parameter mT5 model on NVIDIA DGX SuperPOD using different numbers of nodes. Scaling from 1 node to 20 nodes yielded a 14.87× speed-up.

NVIDIA is actively working on improving the scaling performance for mT5 models. The table and chart below show the performance results.

		Nodes
		1	2	4	5	10	20
	Tokens per Second	91166	179583	346263	429088	798570	1303767
3B	Perfect Linear Scaling (Tokens)	91166	182331	364663	455829	911657	1823314
	Speed-up	1x	1.97x	3.8x	4.71x	8.76x	14.3x

../_images/3B_mT5_throughput_22081.svg — 3B mT5 NeMo Framework Throughput

Inference Performance

Inference performance was measured for NVIDIA DGX SuperPOD (1 × 8 × A100 80GB).

Inference configurations:

Batch size: 1
Input tokens length: 60
Output tokens length: 20

../_images/infer_model_size_mt51.svg — Average Latency vs mT5 Model Size

mT5 Model size	Average latency [ms]	TP	PP	GPUs
380M	35	1	1	1
3B	102	2	1	2
11B	134	4	1	4
23B	230	4	1	4