Release Notes – Release 0.11.0 (BETA)¶
Key Features and Enhancements¶
[pyTorch] Added
RMSNorm
module[pyTorch] Added normalization option to the
LayerNormLinear
,LayerNormMLP
, andTransformerLayer
modules to let the user choose between LayerNorm and RMSNorm normalization.[pyTorch] Added FlashAttention v2 support.
[pyTorch] Added support for Multi-Query and Grouped-Query Attention.
[pyTorch] Added cuDNN attention for long sequence lengthsas a backend for
DotProductAttention
.
Fixed Issues¶
Fixed issues with the ONNX export of the
LayerNorm
module.Fixed a problem with discovery of the Transformer Engine library in the Python virtual environment.
Fixed a crash occurring when trying to combine
torch.compile
with the Transformer Engine modules.
Known Issues in This Release¶
FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (https://github.com/Dao-AILab/flash-attention/issues/358). You can work around this issue either by setting the environment variable
MAX_JOBS=1
during Transformer Engine installation, or by installing FlashAttention v1 (e.g. running pip install flash-attn==1.0.9) before attempting to install Transformer Engine.
Breaking Changes in This Release¶
[JAX] The
TransformerLayer
argument attn_type is removed and superseded by the argument attn_mask_type.
Deprecated Features¶
[pyTorch] The
TransformerLayer
arguments attention_softmax_in_fp32 and apply_query_key_layer_scaling are deprecated, and will be removed in a future release. The default behavior is as if those arguments were set toTrue
.