Released in 2020, Google’s mT5 is a multilingual language model based on the transformer encoder-decoder framework. Leveraging a massive multilingual dataset with over 100 languages, mT5 achieved state-of-the-art benchmark scores across multiple languages. With offerings in “small”, “base”, “large”, “xl”, and “xxl”, mt5 has a variety of model sizes to suit a variety of needs. More information is available in the companion paper “mT5: A massively multilingual pre-trained text-to-text transformer”.
- Data Preparation
- Training
- Training with Predefined Configurations
- Checkpoint Conversion
- Model Evaluation
- PEFT Training and Inference
- Model Fine-Tuning
- MT5 Results
Feature |
T5/mT5 |
---|---|
Data parallelism | ✓ |
Tensor parallelism | ✓ |
Pipeline parallelism | ✓ |
Interleaved Pipeline Parallelism Sched | N/A |
Sequence parallelism | ✗ |
Selective activation checkpointing | ✗ |
Gradient checkpointing | ✓ |
Partial gradient checkpointing | ✓ |
FP32/TF32 | ✓ |
AMP/FP16 | ✗ |
BF16 | ✓ |
TransformerEngine/FP8 | ✗ |
Multi-GPU | ✓ |
Multi-Node | ✓ |
Inference | N/A |
Slurm | ✓ |
Base Command Manager | ✓ |
Base Command Platform | ✓ |
Distributed data preprcessing | ✓ |
NVfuser | ✗ |
P-Tuning and Prompt Tuning | ✓ |
IA3 and Adapter learning | ✓ |
Distributed Optimizer | ✓ |
Distributed Checkpoint | N/A |
Fully Shared Data Parallel | N/A |