BERT
Released in 2018, Bidirectional Representations from Transformers (BERT) is a encoder-only transformer network augmented with its namesake bidirectional layers. BERT has become a staple in the Natural Language Processing domain, and the ideas in the original paper have been adapted to many different domains and applications.
Feature |
Status |
---|---|
Data parallelism |
✓ |
Tensor parallelism |
✓ |
Pipeline parallelism |
✓ |
Interleaved Pipeline Parallelism Sched |
N/A |
Sequence parallelism |
✓ |
Selective activation checkpointing |
✓ |
Gradient checkpointing |
✓ |
Partial gradient checkpointing |
✓ |
FP32/TF32 |
✓ |
AMP/FP16 |
✗ |
BF16 |
✓ |
TransformerEngine/FP8 |
✗ |
Multi-GPU |
✓ |
Multi-Node |
✓ |
Inference |
N/A |
Slurm |
✓ |
Base Command Manager |
✓ |
Base Command Platform |
✓ |
Distributed data preprcessing |
✓ |
NVfuser |
✗ |
P-Tuning and Prompt Tuning |
N/A |
IA3 and Adapter learning |
N/A |
Distributed Optimizer |
✓ |
Distributed Checkpoint |
✓ |
Fully Shared Data Parallel |
N/A |
Torch Distributed Checkpoint |
✓ |