Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Mixed Precision Training#

Mixed precision training significantly enhances computational efficiency by conducting operations in low-precision format, while selectively maintaining minimal data in single-precision to preserve critical information throughout key areas of the network. NeMo Framework now supports FP16, BF16, and FP8 via Transformer Engine (TE) across most models.

Half-precision Training#

NeMo Framework supports half-precision FP16 and BF16 computation training via Megatron Core and the distributed optimizer. This training recipe uses half-precision in all layer computation keeping the model states (optimizer states and master parameters) in single-precision. To avoid repeated data type casting at each layer computation, Megatron Core keeps a separate copy of half-precision parameters that is updated after each optimizer step.

Half-precision training is enabled when setting precision to either of fp16-mixed or bf16-mixed along with megatron_amp_O2=true. The parameter gradients are computed in the same half-precision, and the precision of gradient reduce-scatter across data-parallel GPUs can be set by optim.grad_sync_dtype.

FP8 Training#

NVIDIA H100 GPU introduced support for a new datatype, FP8 (8-bit floating point), enabling higher throughput of matrix multiplies and convolutions. NeMo Framework uses the NVIDIA TransformerEngine (TE) to leverage speedups from FP8. The following table summarizes the FP8 related arguments that can be configured in NeMo (example config setting). For a more detailed overview, refer to the TE documentation, specifically the FP8 format and recipe.

FP8 arguments#

Argument

Description

transformer_engine

TE and related functionality can be enabled by setting this boolean argument to True. If this argument is not set to True, all subsequent arguments will be ignored.

fp8

Enables FP8 training. For transformer networks, the QKV, projection, FC1, and FC2 matrix multiplications are executed using the fourth-generation NVIDIA H100 Tensor Cores with FP8 support.

fp8_e4m3

Training recipe format for FP8. Activations, weights, and gradient tensors use the E4M3 format.

fp8_hybrid

Training recipe format for FP8. Activations and weight tensors use the E4M3 format, whereas gradient use the E5M2 format to satisfy the additional dynamic range requirement for backward tensors. This is the default setting.

fp8_margin

The scaling factor for FP8 tensors can be shifted by a factor of $2 ^ {margin}$ using this argument.

fp8_amax_history_len

Window size for amax history. The window size determines how many instances of the most recent absolute max values (amaxes) are stored per tensor.

fp8_amax_compute_algo

The choice between “max” and “most_recent” specifies how to select an amax value from the given history.

reduce_amax

Indicates whether or not to perform an allreduce on the amax (absolute max) values for the FP8 tensors. Since the amax is directly used to compute the scaling factor for FP8 tensors, setting this argument ensures that the scaling factors for a tensor remain synchronized across devices in multi-GPU training configurations.

fp8_params

Indicates whether to store module-level parameters in FP8. Enabling this option can reduce memory consumption by eliminating the need to store a copy of weights in higher precision for cases where these weights are externally maintained, such as master parameters in the optimizer. For more information, refer to the fp8_model_init API in TE.

Resources#