NeMo Megatron supports 5 types of parallelisms (which can be mixed together arbitraritly):
Distributed Data parallelism (DDP) creates idential copies of the model across multiple GPUs.
With Tensor Paralellism (TP) a tensor is split into non-overlapping pieces and different parts are distributed and processed on separate GPUs.
With Pipeline Paralellism (PP) consecutive layer chunks are assigned to different GPUs.
Expert Paralellim (EP) distributes experts across GPUs.
When reading and modifying NeMo Megatron code you will encounter the following terms.