Release Notes Release 1.7

Key Features and Enhancements

  • [JAX] Added support for SwiGLU, gated/non-gated ReLU, Quick GeLU, and squared ReLU activations.

  • [pyTorch] Added support for attention bias and various QKV formats when using context parallelism.

  • [pyTorch] Expanded the Linear API to handle zero input tokens for MoE-like use cases.

  • [pyTorch] Added support for upstream AMP (torch.amp.autocast) in the checkpoint API.

  • [pyTorch] Added squared-relu activation.

  • [pyTorch] Updated flash-attention support to version 2.5.8.

  • [paddle-paddle] Added support for gradient accumulation fusion.

Fixed Issues

  • [pyTorch] Fixed an uninitialized TP group error that could occur when training with certain tensor parallel configs.

  • [pyTorch] Fixed a bug that occured when loading a checkpoint with calibrated high-precision weights.

  • [pyTorch] Improved the documentation for attention mask.

  • [JAX] Fixed a bug with mismatching shapes of activations and corresponding sharding constraints.

  • [JAX] Fixed an internal bug which caused an incorrect shape to be passed for Layernorm gradient.

Known Issues in This Release

There are no known issues in this release.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

There are no deprecated features in this release.