BERT

Released in 2018, Bidirectional Representations from Transformers (BERT) is a encoder-only transformer network augmented with its namesake bidirectional layers. BERT has become a staple in the Natural Language Processing domain, and the ideas in the original paper have been adapted to many different domains and applications.

Feature

Status

Data parallelism

Tensor parallelism

Pipeline parallelism

Interleaved Pipeline Parallelism Sched

N/A

Sequence parallelism

Selective activation checkpointing

Gradient checkpointing

Partial gradient checkpointing

FP32/TF32

AMP/FP16

BF16

TransformerEngine/FP8

Multi-GPU

Multi-Node

Inference

N/A

Slurm

Base Command Manager

Base Command Platform

Distributed data preprcessing

NVfuser

P-Tuning and Prompt Tuning

N/A

IA3 and Adapter learning

N/A

Distributed Optimizer

Distributed Checkpoint

Fully Shared Data Parallel

N/A

Torch Distributed Checkpoint