Release Notes Release 1.2.1

Key Features and Enhancements

  • [pyTorch] Added sliding window support for DotProductAttention.

  • [pyTorch] Increased performance of DotProductAttention on Hopper GPUs using cuDNN.

  • [pyTorch] Added support for the Falcon architecture in TransformerLayer via the new option parallel_attention_mlp.

  • [pyTorch] Improved checkpointing logic when using fp8_model_init.

  • [JAX] Added support for controlling SM margin in LayerNorm and RMSNorm kernel via environment variables NVTE_FWD_LAYERNORM_SM_MARGIN and NVTE_BWD_LAYERNORM_SM_MARGIN.

Fixed Issues

  • Weight gradient could be computed incorrectly in some cases when FP8 execution and sequence parallelism were used together.

  • Statistics were computed incorrectly during FP8 calibration.

  • Using torch.compile on DotProductAttention module caused a crash.

  • Rotary embeddings during pipeline-parallel inference did not operate correctly.

  • Incorrect mask type used by the decoder in encoder-decoder architectures.

  • Exporting Transformer Engine modules to ONNX in recent versions of pyTorch did not work correctly.

  • Statistics could be computed incorrectly when training with FP8 in recent versions of pyTorch. For details see https://github.com/NVIDIA/TransformerEngine/issues/600.

Known Issues in This Release

  • FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (https://github.com/Dao-AILab/flash-attention/issues/358).

    You can work around this issue either by setting the environment variable MAX_JOBS=1 during Transformer Engine installation, or by installing FlashAttention v1 (e.g. by running pip install flash-attn==1.0.9) before attempting to install Transformer Engine.

  • [pyTorch] FlashAttention v2.1 changed the behavior of the causal mask when performing cross-attention. (See https://github.com/Dao-AILab/flash-attention#21-change-behavior-of-causal-flag for reference.) To keep Transformer Engine behavior consistent between versions and backends, FlashAttention is disabled for this use case (cross attention with casual masking) when 2.1+ version of FlashAttention is installed.

Breaking Changes in This Release

  • There are no breaking changes in this release.

Deprecated Features

  • There are no deprecated features in this release.