Release Notes – Release 1.2.0¶

Key Features and Enhancements¶

[pyTorch] Sliding window support is added for DotProductAttention.
[pyTorch] Performance of DotProductAttention is increased on Hopper GPUs by utilizing cuDNN.
[pyTorch] Support for the Falcon architecture is added in TransformerLayer via the new option parallel_attention_mlp.
[pyTorch] Checkpointing logic when using fp8_model_init is improved.
[JAX] Support is added for controlling SM margin in LayerNorm and RMSNorm kernel via environment variables NVTE_FWD_LAYERNORM_SM_MARGIN and NVTE_BWD_LAYERNORM_SM_MARGIN.

Weight gradient could be computed incorrectly in some cases when FP8 execution and sequence parallelism were used together.
Statistics were computed incorrectly during FP8 calibration.
Using torch.compile on DotProductAttention module caused a crash.
Rotary embeddings during pipeline-parallel inference did not operate correctly.
Incorrect mask type used by the decoder in encoder-decoder architectures.
Exporting Transformer Engine modules to ONNX in recent versions of pyTorch did not work correctly.

FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (https://github.com/Dao-AILab/flash-attention/issues/358).

You can work around this issue either by setting the environment variable MAX_JOBS=1 during Transformer Engine installation, or by installing FlashAttention v1 (e.g. by running pip install flash-attn==1.0.9) before attempting to install Transformer Engine.
[pyTorch] FlashAttention v2.1 changed the behavior of the causal mask when performing cross-attention. (See https://github.com/Dao-AILab/flash-attention#21-change-behavior-of-causal-flag for reference.) To keep Transformer Engine behavior consistent between versions and backends, FlashAttention is disabled for this use case (cross attention with casual masking) when 2.1+ version of FlashAttention is installed.