Release Notes – Release 0.13.0 (BETA)¶

Key Features and Enhancements¶

[pyTorch] Fixed the misaligned address issue in the unfused softmax path (https://github.com/NVIDIA/TransformerEngine/issues/295).
[pyTorch] Fixed an issue where in some cases using the cuDNN backend of the fused attention would corrupt the random number generator state.
Enabled rigorous error checking in the FusedAttention backend to catch unsupported use cases.
[pyTorch] Bug fix in ONNX export not allowing users to specify type of attention mask.
[pyTorch] bug fix in LayerNorm when using grouped query attention.
[JAX] Fix a bug in LayerNorm backward that resulted in incorrect sharding when using FSDP.

FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (https://github.com/Dao-AILab/flash-attention/issues/358). You can work around this issue either by setting the environment variable MAX_JOBS=1 during Transformer Engine installation, or by installing FlashAttention v1 (e.g. running pip install flash-attn==1.0.9) before attempting to install Transformer Engine.
There is a known crash when using the TransformerLayer and MultiheadAttention APIs with the rotary_pos_emb option.

[pyTorch] The TransformerLayer arguments attention_softmax_in_fp32 and apply_query_key_layer_scaling are deprecated, and will be removed in a future release. The default behavior is as if those arguments were set to True.