Release Notes – Release 0.13.0 (BETA)¶
Key Features and Enhancements¶
[pyTorch] Support for switching training precision between iterations.
Fixed Issues¶
[pyTorch] Fixed the misaligned address issue in the unfused softmax path (https://github.com/NVIDIA/TransformerEngine/issues/295).
[pyTorch] Fixed an issue where in some cases using the cuDNN backend of the fused attention would corrupt the random number generator state.
Enabled rigorous error checking in the FusedAttention backend to catch unsupported use cases.
[pyTorch] Bug fix in ONNX export not allowing users to specify type of attention mask.
[pyTorch] bug fix in LayerNorm when using grouped query attention.
[JAX] Fix a bug in LayerNorm backward that resulted in incorrect sharding when using FSDP.
Known Issues in This Release¶
FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (https://github.com/Dao-AILab/flash-attention/issues/358). You can work around this issue either by setting the environment variable
MAX_JOBS=1
during Transformer Engine installation, or by installing FlashAttention v1 (e.g. running pip install flash-attn==1.0.9) before attempting to install Transformer Engine.There is a known crash when using the
TransformerLayer
andMultiheadAttention
APIs with the rotary_pos_emb option.
Breaking Changes in This Release¶
There are no breaking changes in this release.
Deprecated Features¶
[pyTorch] The
TransformerLayer
arguments attention_softmax_in_fp32 and apply_query_key_layer_scaling are deprecated, and will be removed in a future release. The default behavior is as if those arguments were set toTrue
.