Release Notes Release 0.13.0 (BETA)

Key Features and Enhancements

  • [pyTorch] Support for switching training precision between iterations.

Fixed Issues

  • [pyTorch] Fixed the misaligned address issue in the unfused softmax path (https://github.com/NVIDIA/TransformerEngine/issues/295).

  • [pyTorch] Fixed an issue where in some cases using the cuDNN backend of the fused attention would corrupt the random number generator state.

  • Enabled rigorous error checking in the FusedAttention backend to catch unsupported use cases.

  • [pyTorch] Bug fix in ONNX export not allowing users to specify type of attention mask.

  • [pyTorch] bug fix in LayerNorm when using grouped query attention.

  • [JAX] Fix a bug in LayerNorm backward that resulted in incorrect sharding when using FSDP.

Known Issues in This Release

  • FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (https://github.com/Dao-AILab/flash-attention/issues/358). You can work around this issue either by setting the environment variable MAX_JOBS=1 during Transformer Engine installation, or by installing FlashAttention v1 (e.g. running pip install flash-attn==1.0.9) before attempting to install Transformer Engine.

  • There is a known crash when using the TransformerLayer and MultiheadAttention APIs with the rotary_pos_emb option.

Breaking Changes in This Release

  • There are no breaking changes in this release.

Deprecated Features

  • [pyTorch] The TransformerLayer arguments attention_softmax_in_fp32 and apply_query_key_layer_scaling are deprecated, and will be removed in a future release. The default behavior is as if those arguments were set to True.