Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Mixed Precision Training
Mixed precision training significantly enhances computational efficiency by conducting operations in low-precision format, while selectively maintaining minimal data in single-precision to preserve critical information throughout key areas of the network. NeMo now supports FP16, BF16, and FP8 (via Transformer Engine) across most models. Further details will be provided shortly.
Half-precision Training
NeMo supports half-precision (FP16 and BF16) computation training via Megatron Core and the distributed optimizer. This training recipe uses half-precision in all layer computation keeping the model states (optimizer states and master parameters) in single-precision. To avoid repeated data type casting at each layer computation, Megatron Core keeps a separate copy of half-precision parameters that is updated after each optimizer.step.
Half-precision training is enabled when setting precision
to either of fp16-mixed
or bf16-mixed
along with megatron_amp_O2=true
.
The parameter gradients are computed in the same half-precision, and the precision of gradient reduce-scatter across data-parallel GPUs can be set by optim.grad_sync_dtype
.
FP8 Training
Overview
NVIDIA H100 GPU introduced support for a new datatype, FP8 (8-bit floating point), enabling higher throughput of matrix multiplies and convolutions. NeMo uses the NVIDIA TransformerEngine (TE) in order to leverage speedups from FP8. The following table summarizes the FP8 related arguments that can be configured in NeMo (example config setting). For a more detailed overview, refer to the TE documentation, specifically the FP8 format and recipe.
Resources
Intro to FP8, floating point formats, and mixed precision training
Performance optimizations that are natively supported in NeMo by enabling FP8 training with TE