.. include:: /content/common.rsts

Release Notes |ndash| Release 1.6
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


Key Features and Enhancements
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

- [pyTorch] Added a new ``make_graphed_callables`` API call\
  for |NVIDIA(r)| CUDA\ |reg| graph capture, including FP8 support.
- [pyTorch] Added experimental support for
  two boolean arguments in the ``DelayedScaling`` FP8 recipe (``fp8_dpa`` and ``fp8_mha``) to support FP8 attention.


Fixed Issues
@@@@@@@@@@@@

- pyTorch] Fixed a numerical issue with storing weights in FP8
  via the ``fp8_model_init`` API call.
- [pyTorch] Fixed a bug that caused PyTorch modules to use excessive memory
  when training with frozen weights by storing unnecessary activations for the backward pass.
- [JAX] Fixed an internal bug that caused an incorrect shape to be passed
  for layernorm gradient.


Known Issues in This Release
@@@@@@@@@@@@@@@@@@@@@@@@@@@@

These issues are unchanged from 24.04.

- FlashAttention v2, which is a dependency of this release
  of Transformer Engine, has a known issue with excessive memory usage during installation (https://github.com/Dao-AILab/flash-attention/issues/358). You can work around this issue either by setting the environment variable ``MAX_JOBS=1`` during Transformer Engine installation, or by installing FlashAttention v1 (e.g. by executing ``pip install flash-attn==1.0.9``) before attempting to install Transformer Engine.
- [pyTorch] FlashAttention v2.1 changed the behavior of the causal mask
  when performing cross-attention (see https://github.com/Dao-AILab/flash-attention#21-change-behavior-of-causal-flag for reference). In order for Transformer Engine to keep consistent behavior between versions and backends, FlashAttention is disabled for this use case (cross attention with casual masking) when 2.1+ version of FlashAttention is installed.


Breaking Changes in This Release
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

There are no breaking changes in this release.


Deprecated Features
@@@@@@@@@@@@@@@@@@@

These issues are unchanged from 24.04.

- [JAX] The arguments `num_heads`, `dropout_rate`, `output_layernorm`,
  `apply_residual_connection_post_layernorm`, and `fuse_qkv` are deprecated in the ``MultiHeadAttention`` API. They are replaced respectively with `num_attention_heads`, `attention_dropout`, `input_layernorm`, `return_layernorm_output`, and `fused_qkv_params`.
- FlashAttention v1 is no longer supported in Transformer Engine.
  The minimum required version is v2.0.6.