Release Notes – Release 1.2.1¶

Key Features and Enhancements¶

[pyTorch] Added sliding window support for DotProductAttention.
[pyTorch] Increased performance of DotProductAttention on Hopper GPUs using cuDNN.
[pyTorch] Added support for the Falcon architecture in TransformerLayer via the new option parallel_attention_mlp.
[pyTorch] Improved checkpointing logic when using fp8_model_init.
[JAX] Added support for controlling SM margin in LayerNorm and RMSNorm kernel via environment variables NVTE_FWD_LAYERNORM_SM_MARGIN and NVTE_BWD_LAYERNORM_SM_MARGIN.

Weight gradient could be computed incorrectly in some cases when FP8 execution and sequence parallelism were used together.
Statistics were computed incorrectly during FP8 calibration.
Using torch.compile on DotProductAttention module caused a crash.
Rotary embeddings during pipeline-parallel inference did not operate correctly.
Incorrect mask type used by the decoder in encoder-decoder architectures.
Exporting Transformer Engine modules to ONNX in recent versions of pyTorch did not work correctly.
Statistics could be computed incorrectly when training with FP8 in recent versions of pyTorch. For details see https://github.com/NVIDIA/TransformerEngine/issues/600.

FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (https://github.com/Dao-AILab/flash-attention/issues/358).

You can work around this issue either by setting the environment variable MAX_JOBS=1 during Transformer Engine installation, or by installing FlashAttention v1 (e.g. by running pip install flash-attn==1.0.9) before attempting to install Transformer Engine.
[pyTorch] FlashAttention v2.1 changed the behavior of the causal mask when performing cross-attention. (See https://github.com/Dao-AILab/flash-attention#21-change-behavior-of-causal-flag for reference.) To keep Transformer Engine behavior consistent between versions and backends, FlashAttention is disabled for this use case (cross attention with casual masking) when 2.1+ version of FlashAttention is installed.