Release Notes – Release 1.2.1¶
Key Features and Enhancements¶
[pyTorch] Added sliding window support for
DotProductAttention
.[pyTorch] Increased performance of
DotProductAttention
on Hopper GPUs using cuDNN.[pyTorch] Added support for the Falcon architecture in
TransformerLayer
via the new option parallel_attention_mlp.[pyTorch] Improved checkpointing logic when using
fp8_model_init
.[JAX] Added support for controlling SM margin in
LayerNorm
and RMSNorm kernel via environment variablesNVTE_FWD_LAYERNORM_SM_MARGIN
andNVTE_BWD_LAYERNORM_SM_MARGIN
.
Fixed Issues¶
Weight gradient could be computed incorrectly in some cases when FP8 execution and sequence parallelism were used together.
Statistics were computed incorrectly during FP8 calibration.
Using torch.compile on DotProductAttention module caused a crash.
Rotary embeddings during pipeline-parallel inference did not operate correctly.
Incorrect mask type used by the decoder in encoder-decoder architectures.
Exporting Transformer Engine modules to ONNX in recent versions of pyTorch did not work correctly.
Statistics could be computed incorrectly when training with FP8 in recent versions of pyTorch. For details see https://github.com/NVIDIA/TransformerEngine/issues/600.
Known Issues in This Release¶
FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (https://github.com/Dao-AILab/flash-attention/issues/358).
You can work around this issue either by setting the environment variable
MAX_JOBS=1
during Transformer Engine installation, or by installing FlashAttention v1 (e.g. by runningpip install flash-attn==1.0.9
) before attempting to install Transformer Engine.[pyTorch] FlashAttention v2.1 changed the behavior of the causal mask when performing cross-attention. (See https://github.com/Dao-AILab/flash-attention#21-change-behavior-of-causal-flag for reference.) To keep Transformer Engine behavior consistent between versions and backends, FlashAttention is disabled for this use case (cross attention with casual masking) when 2.1+ version of FlashAttention is installed.
Breaking Changes in This Release¶
There are no breaking changes in this release.
Deprecated Features¶
There are no deprecated features in this release.