Release Notes – Release 1.1.0¶
Key Features and Enhancements¶
[pyTorch] Memory usage is reduced when using the
fp8_model_init
API during inference.[pyTorch] Memory usage is reduced when using the
LayerNormLinear
,LayernormMLP
, andTransformerLayer
APIs.[JAX] Transformer Engine is migrated to the new Custom Partitioning mechanism of parallelism for custom ops in JAX.
[JAX] The attention operation’s performance is improved when using cuDNN version 8.9.6 or greater.
[C/C++] Transformer Engine can now be built as a subproject.
Fixed Issues¶
In some cases passing the non-contiguous tensors as Q, K, or V to
DotProductAttention
would result in an error, “Exception: The provided qkv memory layout is not supported!.”
Known Issues in This Release¶
FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (https://github.com/Dao-AILab/flash-attention/issues/358). You can work around this issue by either of these means:
Setting the
MAX_JOBS
environment variable to1
during Transformer Engine installationInstalling FlashAttention v1 (e.g. by
pip install flash-attn==1.0.9
) before attempting to install Transformer Engine
[pyTorch] FlashAttention v2.1 has changed the behavior of the causal mask when performing cross-attention (see https://github.com/Dao-AILab/flash-attention#21-change-behavior-of-causal-flag for reference). For Transformer Engine to preserve consistent behavior between versions and back ends, FlashAttention is disabled for this use case (i.e. cross-attention with casual masking) when FlashAttention version 2.1+ is installed.
Breaking Changes in This Release¶
There are no breaking changes in this release.
Deprecated Features¶
There are no deprecated features in this release.