Release Notes Release 0.11.0 (BETA)

Key Features and Enhancements

  • [pyTorch] Added RMSNorm module

  • [pyTorch] Added normalization option to the LayerNormLinear, LayerNormMLP, and TransformerLayer modules to let the user choose between LayerNorm and RMSNorm normalization.

  • [pyTorch] Added FlashAttention v2 support.

  • [pyTorch] Added support for Multi-Query and Grouped-Query Attention.

  • [pyTorch] Added cuDNN attention for long sequence lengthsas a backend for DotProductAttention.

Fixed Issues

  • Fixed issues with the ONNX export of the LayerNorm module.

  • Fixed a problem with discovery of the Transformer Engine library in the Python virtual environment.

  • Fixed a crash occurring when trying to combine torch.compile with the Transformer Engine modules.

Known Issues in This Release

  • FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (https://github.com/Dao-AILab/flash-attention/issues/358). You can work around this issue either by setting the environment variable MAX_JOBS=1 during Transformer Engine installation, or by installing FlashAttention v1 (e.g. running pip install flash-attn==1.0.9) before attempting to install Transformer Engine.

Breaking Changes in This Release

  • [JAX] The TransformerLayer argument attn_type is removed and superseded by the argument attn_mask_type.

Deprecated Features

  • [pyTorch] The TransformerLayer arguments attention_softmax_in_fp32 and apply_query_key_layer_scaling are deprecated, and will be removed in a future release. The default behavior is as if those arguments were set to True.