# Transformer Engine v2.16 Release Notes

## Key Features and Enhancements

- [Common] Improved the performance of the split-overlap reduce-scatter GEMMs. ([#2056](https://github.com/NVIDIA/TransformerEngine/pull/2056))
- [Common] Improved the fused MoE auxiliary loss kernel performance for models with a large number of experts. ([#2758](https://github.com/NVIDIA/TransformerEngine/pull/2758))
- [Common] Optimized MXFP8 and NVFP4 dequantize kernels for improved performance. ([#2865](https://github.com/NVIDIA/TransformerEngine/pull/2865))
- [Common] Improved performance of the MXFP8 quantization kernels. ([#2958](https://github.com/NVIDIA/TransformerEngine/pull/2958))
- [PyTorch] Added `pad_between_seqs` support for non-CP and CP (A2A and P2P) with FA3 + THD (varlen) attention. ([#2596](https://github.com/NVIDIA/TransformerEngine/pull/2596))
- [PyTorch] Added role-based custom quantization control, enabling recipes to target specific modules and tensor types. ([#2620](https://github.com/NVIDIA/TransformerEngine/pull/2620))
- [PyTorch] Added end-to-end Mixtral MoE examples showing TE GroupedLinear integration with HuggingFace models for BF16 and FP8 training. ([#2642](https://github.com/NVIDIA/TransformerEngine/pull/2642))
- [PyTorch] Increased performance of the CPU activation offloading path in some cases ([#2793](https://github.com/NVIDIA/TransformerEngine/pull/2793))
- [PyTorch] Reduced the CPU overhead in the GroupedLinear module and operation ([#2900](https://github.com/NVIDIA/TransformerEngine/pull/2900)) ([#2957](https://github.com/NVIDIA/TransformerEngine/pull/2957)) ([#2666](https://github.com/NVIDIA/TransformerEngine/pull/2666))
- [PyTorch] Added CUDA Graph capture support for GroupedLinear and grouped MoE operations on supported configurations. ([#2923](https://github.com/NVIDIA/TransformerEngine/pull/2923))
- [PyTorch] Added FlashAttention 4 support for attention head dimension 256. ([#2932](https://github.com/NVIDIA/TransformerEngine/pull/2932))
- [JAX] Improved MoE permutation kernel performance. ([#2975](https://github.com/NVIDIA/TransformerEngine/pull/2975))
- [JAX] Improved JAX tutorial documentation with updated examples and guidance. ([#2976](https://github.com/NVIDIA/TransformerEngine/pull/2976))
- [Common, PyTorch] Added bias and dbias support for GroupedLinear layers. ([#2885](https://github.com/NVIDIA/TransformerEngine/pull/2885))
- [Common, PyTorch] Added variable grouped swizzle support for flexible grouped tensor memory layouts. ([#2914](https://github.com/NVIDIA/TransformerEngine/pull/2914))
- [Common, PyTorch] Implemented a row-scaled NVFP4 forward propagation recipe. ([#2931](https://github.com/NVIDIA/TransformerEngine/pull/2931))
- [Common, PyTorch] Expanded grouped GEMM support with NVFP4 on Blackwell and FP8 block scaling on Hopper. ([#2971](https://github.com/NVIDIA/TransformerEngine/pull/2971))
- [Common, JAX] Added a top-k operation for faster MoE routing. ([#2890](https://github.com/NVIDIA/TransformerEngine/pull/2890))
- [Common, JAX] Enabled the cuDNN fused attention backend for no-mask bidirectional sliding-window attention. ([#2961](https://github.com/NVIDIA/TransformerEngine/pull/2961))

## Fixed Issues

- [PyTorch] Fixed variable-length attention cache reuse across devices and inference/training modes. ([#2728](https://github.com/NVIDIA/TransformerEngine/pull/2728))
- [PyTorch] Fixed FSDP2 memory leaks for FP8 weight workspaces and transpose caches. ([#2805](https://github.com/NVIDIA/TransformerEngine/pull/2805))
- [PyTorch] Fixed TE fuser behavior in `torch.no_grad()` paths by avoiding invalid gradient-flag updates on non-leaf tensors. ([#2919](https://github.com/NVIDIA/TransformerEngine/pull/2919))
- [PyTorch] Fixed distributed checkpoint loading for FSDP2 for models initialized with `QuantizedModelInit`. ([#2974](https://github.com/NVIDIA/TransformerEngine/pull/2974))
- [Common, PyTorch] Fixed cuBLAS grouped GEMM when weight dimensions are not divisible by 128. ([#2954](https://github.com/NVIDIA/TransformerEngine/pull/2954))
- [Common, PyTorch] Fixed int32 overflow and &minus;1 sentinel value handling in `moe_permute`. ([#2907](https://github.com/NVIDIA/TransformerEngine/pull/2907))
- [Common, PyTorch] Fixed context-parallel FlashAttention output handling when FA3 is installed without FA2.([#2825](https://github.com/NVIDIA/TransformerEngine/pull/2825))
- [Common, PyTorch] Disabled RHT quantization fusion on unsupported GPU architectures to avoid launch failures. ([#2968](https://github.com/NVIDIA/TransformerEngine/pull/2968))
- [PyTorch] Fixed a crash coming from GroupedLinear weight-gradient allocation. ([#3049](https://github.com/NVIDIA/TransformerEngine/pull/3049))

## Breaking Changes in This Release

- [Common, PyTorch] The original FP8 delayed-scaling fused attention path has been removed. FP8 attention now uses the current cuDNN-backed implementation. ([#2959](https://github.com/NVIDIA/TransformerEngine/pull/2959))
- [Common, PyTorch, JAX] Removed the legacy f16_max512 fused-attention backend. BF16/FP16 attention is routed through the maintained arbitrary-sequence backend, but explicit selections of the old backend must be updated. ([#2949](https://github.com/NVIDIA/TransformerEngine/pull/2949))

## Deprecated Features

There are no deprecated features in this release.