Mixed Precision Training

Mixed precision training significantly enhances computational efficiency by conducting operations in low-precision format, while selectively maintaining minimal data in single-precision to preserve critical information throughout key areas of the network. NeMo now supports FP16, BF16, and FP8 (via Transformer Engine) across most models. Further details will be provided shortly.

Half-precision Training

NeMo supports half-precision (FP16 and BF16) computation training via Megatron Core and the distributed optimizer. This training recipe uses half-precision in all layer computation keeping the model states (optimizer states and master parameters) in single-precision. To avoid repeated data type casting at each layer computation, Megatron Core keeps a separate copy of half-precision parameters that is updated after each optimizer.step.

Half-precision training is enabled when setting precision to either of fp16-mixed or bf16-mixed along with megatron_amp_O2=true. The parameter gradients are computed in the same half-precision, and the precision of gradient reduce-scatter across data-parallel GPUs can be set by optim.grad_sync_dtype.

FP8 Training

Overview

NVIDIA H100 GPU introduced support for a new datatype, FP8 (8-bit floating point), enabling higher throughput of matrix multiplies and convolutions. NeMo uses the NVIDIA TransformerEngine (TE) in order to leverage speedups from FP8. The following table summarizes the FP8 related arguments that can be configured in NeMo (example config setting). For a more detailed overview, refer to the TE documentation, specifically the FP8 format and recipe.

Resources