Release Notes Release 1.4

Key Features and Enhancements

  • [C/pyTorch] Added support for QuickGELU activation.

  • [C/pyTorch] Added fused RoPE implementation for improved speedup.

  • [C/pyTorch] Added support for zero centered gamma in RMSNorm.

  • [C/pyTorch] Added support for alibi slopes to all attention backends.

  • [docs/pyTorch] Added a tutorial on accelerating HF Llama models with Transformer Engine.

  • [JAX] Added support for sequence parallelism.

  • [JAX] Added support for RoPE.

  • [JAX] Added support for GELU.

  • [JAX] Increased execution speed in GQA.

  • [paddle] Added support for grouped query attention (GQA).

Fixed Issues

  • [pyTorch] Fixed an issue where uninitialized/unused module buffers resulted in increased memory usage with the fp8_model_init API call.

  • [pyTorch] Fixed an issue in MultiheadAttention where the attention type was not properly passed down into granular API calls.

  • [pyTorch] Fixed an issue that caused Transformer Engine to crash when used with pyTorch version >= 2.0 and < 2.1.

  • [pyTorch] Fixed a convergence issue when using FP8 with activation recompute.

  • [pyTorch] Fixed a numerical bug associated with use of pipeline parallelism.

Known Issues in This Release

  • FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (https://github.com/Dao-AILab/flash-attention/issues/358). You can work around this issue either by setting the environment variable MAX_JOBS=1 during Transformer Engine installation or by installing FlashAttention v1 (e.g. with the command pip install flash-attn==1.0.9) before attempting to install Transformer Engine.

  • [pyTorch] FlashAttention v2.1 changed the behavior of the causal mask when performing cross-attention (see https://github.com/Dao-AILab/flash-attention#21-change-behavior-of-causal-flag for reference). For Transformer Engine to keep consistent behavior between versions and backends, FlashAttention is disabled for the use case “cross attention with casual masking” when 2.1+ version of FlashAttentionA is installed.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

There are no deprecated features in this release.

Miscellaneous Changes

FlashAttention v1 is not longer supported in Transformer Engine. Support for it was dropped in version 1.3. The minimum required FlashAttention version is v2.0.6.