# Transformer Engine v2.12 Release Notes


## Key Features and Enhancements


- Made miscellaneous improvements and fixes to the documentation.

- \[C] Improved performance of NVFP4 quantization kernels. ([#2412](https://github.com/NVIDIA/TransformerEngine/pull/2412))

- \[C] Documented environment variables. ([#2552](https://github.com/NVIDIA/TransformerEngine/pull/2552))

- \[PyTorch] Added fused permute+pad and unpermute+unpad operations for FP8 optimization. ([#1921](https://github.com/NVIDIA/TransformerEngine/pull/1921))

- \[PyTorch] Improved the performance in CPU-limited scenarios.

- \[PyTorch] Added support for Sliding Window Attention (left, right) with fused attention. ([#2477](https://github.com/NVIDIA/TransformerEngine/pull/2477))

- \[PyTorch] Improved the performance of MXFP8 and NVFP4 by fusing the swizzling into the quantization ([#2486](https://github.com/NVIDIA/TransformerEngine/pull/2486))

- \[PyTorch] Added cudagraph support for activation recomputation. ([#2518](https://github.com/NVIDIA/TransformerEngine/pull/2518))

- \[JAX] Added a tutorial for integrating TE/JAX quantization into existing frameworks. ([#2423](https://github.com/NVIDIA/TransformerEngine/pull/2423))

- \[JAX] Added custom partitioning for permutation primitives. ([#2591](https://github.com/NVIDIA/TransformerEngine/pull/2591))


## Fixed Issues

- \[C] Fixed SM120 compilation with CUDA 12. ([#2482](https://github.com/NVIDIA/TransformerEngine/pull/2482))

- \[C] Fixed overflow in padding and unpadding kernels. ([#2548](https://github.com/NVIDIA/TransformerEngine/pull/2548))

- \[C] Fixed a numerical issue in ``sort_chunks_by_index``. ([#2566](https://github.com/NVIDIA/TransformerEngine/pull/2566))

- \[C] Fixed a numerical issue in swizzling blockwise E8 scales. ([#2589](https://github.com/NVIDIA/TransformerEngine/pull/2589))

- \[PyTorch] Fixed an AttributeError issue when checkpointing the model with MXFP8 parameters. ([#2427](https://github.com/NVIDIA/TransformerEngine/pull/2427))

- \[PyTorch] Fixed cross-entropy loss calculation when some tokens are ignored. ([#2476](https://github.com/NVIDIA/TransformerEngine/pull/2476))

- \[PyTorch] Fixed ``Float8Tensor.contiguous`` autograd support. ([#2533](https://github.com/NVIDIA/TransformerEngine/pull/2533))

- \[PyTorch] Fixed multiple CPU offloading issues. ([#2535](https://github.com/NVIDIA/TransformerEngine/pull/2535))

- \[PyTorch] Fixed uninitialized ``permuted_scale`` values. ([#2547](https://github.com/NVIDIA/TransformerEngine/pull/2547))

- \[PyTorch] Fixed FP8 quantization for the second MLP in ``LayerNormMLP``. ([#2577](https://github.com/NVIDIA/TransformerEngine/pull/2577))

- \[PyTorch] Fixed ONNX tests and added FP8 attention export support. ([#2598](https://github.com/NVIDIA/TransformerEngine/pull/2598))

- \[JAX] Removed unused TE DPA dtype handling to improve cuDNN backend dtype detection. ([#2485](https://github.com/NVIDIA/TransformerEngine/pull/2485))

- \[JAX] Fixed segment-position calculation from segment IDs in `SequenceDescriptor` class. ([#2523](https://github.com/NVIDIA/TransformerEngine/pull/2523))

- \[JAX] Fixed bugs in permutation custom partitioning. ([#2617](https://github.com/NVIDIA/TransformerEngine/pull/2617))

- \[JAX] Fixed issue in encoder and MNIST examples due to dataset path moving. ([#2625](https://github.com/NVIDIA/TransformerEngine/pull/2625))


## There are no breaking Changes in This Release

There are no breaking changes in this release.

## Deprecated Features

There are no deprecated features in this release.