.. include:: /content/common.rsts Release Notes |ndash| Release 1.4 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Key Features and Enhancements @@@@@@@@@@@@@@@@@@@@@@@@@@@@@ - [C/pyTorch] Added support for QuickGELU activation. - [C/pyTorch] Added fused RoPE implementation for improved speedup. - [C/pyTorch] Added support for zero centered gamma in RMSNorm. - [C/pyTorch] Added support for alibi slopes to all attention backends. - [docs/pyTorch] Added a tutorial on accelerating HF Llama models with Transformer Engine. - [JAX] Added support for sequence parallelism. - [JAX] Added support for RoPE. - [JAX] Added support for GELU. - [JAX] Increased execution speed in GQA. - [paddle] Added support for grouped query attention (GQA). Fixed Issues @@@@@@@@@@@@ - [pyTorch] Fixed an issue where uninitialized/unused module buffers resulted in increased memory usage with the ``fp8_model_init`` API call. - [pyTorch] Fixed an issue in MultiheadAttention where the attention type was not properly passed down into granular API calls. - [pyTorch] Fixed an issue that caused Transformer Engine to crash when used with pyTorch version >=\ |nbsp|\ 2.0 and <\ |nbsp|\ 2.1. - [pyTorch] Fixed a convergence issue when using FP8 with activation recompute. - [pyTorch] Fixed a numerical bug associated with use of pipeline parallelism. Known Issues in This Release @@@@@@@@@@@@@@@@@@@@@@@@@@@@ - FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (https://github.com/Dao-AILab/flash-attention/issues/358). You can work around this issue either by setting the environment variable ``MAX_JOBS=1`` during Transformer Engine installation or by installing FlashAttention v1 (e.g. with the command ``pip install flash-attn==1.0.9``) before attempting to install Transformer Engine. - [pyTorch] FlashAttention v2.1 changed the behavior of the causal mask when performing cross-attention (see https://github.com/Dao-AILab/flash-attention#21-change-behavior-of-causal-flag for reference). For Transformer Engine to keep consistent behavior between versions and backends, FlashAttention is disabled for the use case "cross attention with casual masking" when 2.1+ version of FlashAttentionA is installed. Breaking Changes in This Release @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ There are no breaking changes in this release. Deprecated Features @@@@@@@@@@@@@@@@@@@ There are no deprecated features in this release. Miscellaneous Changes @@@@@@@@@@@@@@@@@@@@@ FlashAttention v1 is not longer supported in Transformer Engine. Support for it was dropped in version 1.3. The minimum required FlashAttention version is v2.0.6.