Release Notes – Release 0.11.0 (BETA)¶

Key Features and Enhancements¶

[pyTorch] Added RMSNorm module
[pyTorch] Added normalization option to the LayerNormLinear, LayerNormMLP, and TransformerLayer modules to let the user choose between LayerNorm and RMSNorm normalization.
[pyTorch] Added FlashAttention v2 support.
[pyTorch] Added support for Multi-Query and Grouped-Query Attention.
[pyTorch] Added cuDNN attention for long sequence lengthsas a backend for DotProductAttention.

Fixed issues with the ONNX export of the LayerNorm module.
Fixed a problem with discovery of the Transformer Engine library in the Python virtual environment.
Fixed a crash occurring when trying to combine torch.compile with the Transformer Engine modules.

FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (https://github.com/Dao-AILab/flash-attention/issues/358). You can work around this issue either by setting the environment variable MAX_JOBS=1 during Transformer Engine installation, or by installing FlashAttention v1 (e.g. running pip install flash-attn==1.0.9) before attempting to install Transformer Engine.

[JAX] The TransformerLayer argument attn_type is removed and superseded by the argument attn_mask_type.

[pyTorch] The TransformerLayer arguments attention_softmax_in_fp32 and apply_query_key_layer_scaling are deprecated, and will be removed in a future release. The default behavior is as if those arguments were set to True.