# Transformer Engine v2.12 Release Notes ## Key Features and Enhancements - Made miscellaneous improvements and fixes to the documentation. - \[C] Improved performance of NVFP4 quantization kernels. ([#2412](https://github.com/NVIDIA/TransformerEngine/pull/2412)) - \[C] Documented environment variables. ([#2552](https://github.com/NVIDIA/TransformerEngine/pull/2552)) - \[PyTorch] Added fused permute+pad and unpermute+unpad operations for FP8 optimization. ([#1921](https://github.com/NVIDIA/TransformerEngine/pull/1921)) - \[PyTorch] Improved the performance in CPU-limited scenarios. - \[PyTorch] Added support for Sliding Window Attention (left, right) with fused attention. ([#2477](https://github.com/NVIDIA/TransformerEngine/pull/2477)) - \[PyTorch] Improved the performance of MXFP8 and NVFP4 by fusing the swizzling into the quantization ([#2486](https://github.com/NVIDIA/TransformerEngine/pull/2486)) - \[PyTorch] Added cudagraph support for activation recomputation. ([#2518](https://github.com/NVIDIA/TransformerEngine/pull/2518)) - \[JAX] Added a tutorial for integrating TE/JAX quantization into existing frameworks. ([#2423](https://github.com/NVIDIA/TransformerEngine/pull/2423)) - \[JAX] Added custom partitioning for permutation primitives. ([#2591](https://github.com/NVIDIA/TransformerEngine/pull/2591)) ## Fixed Issues - \[C] Fixed SM120 compilation with CUDA 12. ([#2482](https://github.com/NVIDIA/TransformerEngine/pull/2482)) - \[C] Fixed overflow in padding and unpadding kernels. ([#2548](https://github.com/NVIDIA/TransformerEngine/pull/2548)) - \[C] Fixed a numerical issue in ``sort_chunks_by_index``. ([#2566](https://github.com/NVIDIA/TransformerEngine/pull/2566)) - \[C] Fixed a numerical issue in swizzling blockwise E8 scales. ([#2589](https://github.com/NVIDIA/TransformerEngine/pull/2589)) - \[PyTorch] Fixed an AttributeError issue when checkpointing the model with MXFP8 parameters. ([#2427](https://github.com/NVIDIA/TransformerEngine/pull/2427)) - \[PyTorch] Fixed cross-entropy loss calculation when some tokens are ignored. ([#2476](https://github.com/NVIDIA/TransformerEngine/pull/2476)) - \[PyTorch] Fixed ``Float8Tensor.contiguous`` autograd support. ([#2533](https://github.com/NVIDIA/TransformerEngine/pull/2533)) - \[PyTorch] Fixed multiple CPU offloading issues. ([#2535](https://github.com/NVIDIA/TransformerEngine/pull/2535)) - \[PyTorch] Fixed uninitialized ``permuted_scale`` values. ([#2547](https://github.com/NVIDIA/TransformerEngine/pull/2547)) - \[PyTorch] Fixed FP8 quantization for the second MLP in ``LayerNormMLP``. ([#2577](https://github.com/NVIDIA/TransformerEngine/pull/2577)) - \[PyTorch] Fixed ONNX tests and added FP8 attention export support. ([#2598](https://github.com/NVIDIA/TransformerEngine/pull/2598)) - \[JAX] Removed unused TE DPA dtype handling to improve cuDNN backend dtype detection. ([#2485](https://github.com/NVIDIA/TransformerEngine/pull/2485)) - \[JAX] Fixed segment-position calculation from segment IDs in `SequenceDescriptor` class. ([#2523](https://github.com/NVIDIA/TransformerEngine/pull/2523)) - \[JAX] Fixed bugs in permutation custom partitioning. ([#2617](https://github.com/NVIDIA/TransformerEngine/pull/2617)) - \[JAX] Fixed issue in encoder and MNIST examples due to dataset path moving. ([#2625](https://github.com/NVIDIA/TransformerEngine/pull/2625)) ## There are no breaking Changes in This Release There are no breaking changes in this release. ## Deprecated Features There are no deprecated features in this release.