Release Notes – Release 2.3¶
Key Features and Enhancements¶
[PyTorch] Sped up import of transformer-engine by moving to a lazy compilation of functions using torch.compile.
[PyTorch] Enabled FP8 weights when using FSDP.
[C][PyTorch] Added support for Float 8 block scaling recipe, as used in the Deepseek v3 paper, for Hopper GPUs.
[PyTorch] Made miscellaneous fixes to reduce CPU overhead.
[PyTorch] Added support for CPU offloading for activation tensors when using FP8 attention.
[PyTorch] Enabled MXFP8 recipe for the
GroupedLinear
module.[PyTorch] Add a feature to support decoupling the weight gradient compute from the backward function of Transformer Engine modules. This allows users to call backward wgrad and gives them finer-grained control over when gradients are called to support certain advanced parallelism/overlap schemes.
Added support for RTX 5090.
Added support for staggered application of rope embedding to a sequence of inputs in a batch, depending on their starting positions.
Fixed Issues¶
[PyTorch] Fixed a numerical bug with use of custom DDP from megatron-core.
[PyTorch] Fixed a crash when using the
checkpoint
method for activation recompute on non-Transformer Engine modules.
Known Issues in This Release¶
There are no known issues in this release.
Breaking Changes in This Release¶
[Jax] Praxis layers have been removed, as PAXML is no longer supported.
Deprecated Features¶
The installation for Transformer Engine now requires use of the –no-build-isolation flag when using PyPI package or building from source. Support for installations with build isolation will be removed in a future release.
[PyTorch] CPU offloading weight tensors is deprecated.
Miscellaneous¶
There are no miscellaneous issues in this release.