Model Overview

You can use the NVIDIA NeMo™ framework to effectively train and scale language models to billions of parameters. With the NeMo Framework you can train different variants of the GPT, BERT, and T5 style models, and scale them to multiple nodes on NVIDIA DGX SuperPOD ™ deployments. This deep learning (DL) software stack is optimized for DGX SuperPOD configurations using NVIDIA InfiniBand technology to provide efficient on-premises compute for training and inferring complex workloads.

The NeMo Framework supports multiple parallelism techniques to enable efficient training of very large models that do not fit in the memory of a single GPU. Such tenchniques include tensor (intra-layer), pipeline (inter-layer), expert parallelism (across experts), context parallelism and sequence parallelism. Tensor model parallelism partitions individual transformer layers over multiple devices. Pipeline model parallelism stripes layers of a model over multiple devices. Expert Parallelism (EP), specific to Mixture-of-Experts layer, distributes experts (usually feed-forward subnetworks) across multiple devices. Context parallelism partitions input sequence into non-overlapping sub-sequences and distributes them across multiple devices for parallel processing. For more details, refer to the paper Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM.

Two new techniques, sequence parallelism and selective activation recomputation, yield up to ~30% faster training time for GPT models ranging from 20B to 1T parameters.

Sequence parallelism expands tensor-level model parallelism by noticing that the regions of a transformer layer that have not previously been parallelized are independent along the sequence dimension. By splitting these layers along the sequence dimension it can distribute the computing load, and most importantly the activation memory, for these regions across the tensor parallel devices.

Selective activation recomputation improves performance in cases where memory constraints force the model to recompute some but not all of the activations. For more details, refer to the paper Reducing Activation Recomputation in Large Transformer Models.

ModelOverview/model_overview.png — GPT Architecture

In the Figure above, the 5B variant includes 24 transformer layers, a hidden size of 4096, and 32 attention heads. The sequence length is 2048, and the optimizer is Adam. This variant usestensor parallelism of 2.