.. include:: /content/common.rsts

Release Notes |ndash| Release 1.2.1
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Key Features and Enhancements
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

- [pyTorch] Added sliding window support for ``DotProductAttention``.
- [pyTorch] Increased performance of ``DotProductAttention``
  on Hopper GPUs using cuDNN.
- [pyTorch] Added support for the Falcon architecture
  in ``TransformerLayer`` via the new option `parallel_attention_mlp`.
- [pyTorch] Improved checkpointing logic
  when using ``fp8_model_init``.
- [JAX] Added support for controlling SM margin in ``LayerNorm``
  and RMSNorm kernel via environment variables ``NVTE_FWD_LAYERNORM_SM_MARGIN`` and ``NVTE_BWD_LAYERNORM_SM_MARGIN``.


Fixed Issues
@@@@@@@@@@@@

- Weight gradient could be computed incorrectly in some cases
  when FP8 execution and sequence parallelism were used together.

- Statistics were computed incorrectly during FP8 calibration.

- Using `torch.compile` on DotProductAttention module caused
  a crash.

- Rotary embeddings during pipeline-parallel inference
  did not operate correctly.

- Incorrect mask type used by the decoder
  in encoder-decoder architectures.

- Exporting Transformer Engine modules to ONNX
  in recent versions of pyTorch did not work correctly.

- Statistics could be computed incorrectly
  when training with FP8 in recent versions of pyTorch. For details see https://github.com/NVIDIA/TransformerEngine/issues/600.


Known Issues in This Release
@@@@@@@@@@@@@@@@@@@@@@@@@@@@

- FlashAttention v2, which is a dependency
  of this release of Transformer Engine, has a known issue with excessive memory usage during installation (https://github.com/Dao-AILab/flash-attention/issues/358).

  You can work around this issue either by setting the environment variable ``MAX_JOBS=1`` during Transformer Engine installation, or by installing FlashAttention v1 (e.g. by running ``pip install flash-attn==1.0.9``) before attempting to install Transformer Engine.

- [pyTorch] FlashAttention v2.1 changed the behavior
  of the causal mask when performing cross-attention. (See https://github.com/Dao-AILab/flash-attention#21-change-behavior-of-causal-flag for reference.) To keep Transformer Engine behavior consistent between versions and backends, FlashAttention is disabled for this use case (cross attention with casual masking) when 2.1+ version of FlashAttention is installed.


Breaking Changes in This Release
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

- There are no breaking changes in this release.

Deprecated Features
@@@@@@@@@@@@@@@@@@@

- There are no deprecated features in this release.