# Transformer Engine v2.15 Release Notes ## Key Features and Enhancements - [PyTorch] Added support for Flash Attention 4. ([#2432](https://github.com/NVIDIA/TransformerEngine/pull/2432)) - [PyTorch] Added support for MXFP8 attention. ([#2719](https://github.com/NVIDIA/TransformerEngine/pull/2719)) - [PyTorch] Added support for QGeGLU activation both in `te.ops` and the fused grouped MLP path using GEMM + activation fusion. ([#2855](https://github.com/NVIDIA/TransformerEngine/pull/2855)) - [PyTorch] Added support for per-token bias probability scaling both in `te.ops` and the fused grouped MLP path using GEMM + activation fusion. ([#2864](https://github.com/NVIDIA/TransformerEngine/pull/2864)) - [PyTorch] Added support for NVFP4 weight quantization in the fused Adam optimizer. ([#2797](https://github.com/NVIDIA/TransformerEngine/pull/2797)) - [PyTorch, Common] Added triton kernels to support mHC (Manifold-Constrained Hyper-Connections). ([#2790](https://github.com/NVIDIA/TransformerEngine/pull/2790)) - [PyTorch, Common] Added support for dequantizing MXFP8 grouped tensors. ([#2722](https://github.com/NVIDIA/TransformerEngine/pull/2722)) - [Common] Added support for unswizzling scaling factors. ([#2837](https://github.com/NVIDIA/TransformerEngine/pull/2837),[#2732](https://github.com/NVIDIA/TransformerEngine/pull/2732)) - [PyTorch] Added Newton–Schulz orthogonalization via cuSOLVERMp for distributed orthogonalization workloads. ([#2706](https://github.com/NVIDIA/TransformerEngine/pull/2706)) - [PyTorch] Added an `NVTE_BACKWARD_OVERRIDE=high_precision|dequantized` environment variable to control backward precision behavior. ([#2644](https://github.com/NVIDIA/TransformerEngine/pull/2644)) - [PyTorch] Added a feature to debug tools to allow tensor dumps before and after quantization for numerical debugging. ([#2645](https://github.com/NVIDIA/TransformerEngine/pull/2645)) - [PyTorch] Optimized FP8 block-scaling AllGather for FSDP2 to reduce communication overhead. ([#2789](https://github.com/NVIDIA/TransformerEngine/pull/2789)) - [PyTorch] Added an example demonstrating high-precision weight initialization with `fully_shard`. ([#2785](https://github.com/NVIDIA/TransformerEngine/pull/2785)) - [PyTorch] Expanded fused grouped MLP support via `te.ops` by lowering the weight dimension requirements to being divisible by 64 (previously 256). ([#2856](https://github.com/NVIDIA/TransformerEngine/pull/2856)) - [PyTorch] Added `torch.compile` support for the MoE permute utility functions. ([#2686](https://github.com/NVIDIA/TransformerEngine/pull/2686)) - [Common, PyTorch] Improved the performance of NVFP4 quantization by refactoring the amax compute kernel. ([#2820](https://github.com/NVIDIA/TransformerEngine/pull/2820)) - [JAX] Reduced THD seqlen and offset computation from `O(T·T)` memory down to `O(T)` for long sequences. ([#2522](https://github.com/NVIDIA/TransformerEngine/pull/2522)) - [JAX] Added MXFP8 grouped quantize + GEMM support. ([#2763](https://github.com/NVIDIA/TransformerEngine/pull/2763)) ## Fixed Issues - [PyTorch] Fixed a numerical bug where stale columnwise weight data would be used for post-validation training steps. ([#2929](https://github.com/NVIDIA/TransformerEngine/pull/2929)) - [PyTorch] Fixed redundant memory usage when using NVFP4 parameters. ([#2834](https://github.com/NVIDIA/TransformerEngine/pull/2834)) - [JAX] Fixed the JAX extension build with `NVTE_UB_WITH_MPI=1`. ([#2835](https://github.com/NVIDIA/TransformerEngine/pull/2835)) - [Common] Fixed a numerical bug for the MoE fused router for large top-K and expert counts. ([#2821](https://github.com/NVIDIA/TransformerEngine/pull/2821)) - [Common] Fixed an illegal memory access in `register_user_buffer_collective` on Ampere (and older) GPUs when using user buffers for COMM-GEMM overlap. ([#2859](https://github.com/NVIDIA/TransformerEngine/pull/2859)) - [Build] Fixed a build crash when compiling from source with `NVTE_CUDA_ARCHS=120`. ([#2832](https://github.com/NVIDIA/TransformerEngine/pull/2832)) ## Known issues - [PyTorch] When building a grouped MLP module via `te.ops.Sequential` in order to use the GEMM + activation fusion, the kernel may produce non-deterministic results in the single grouped-weight case (i.e., when the environment variable `NVTE_GROUPED_LINEAR_SINGLE_PARAM` and the corresponding module argument `single_grouped_weight` is set). - [PyTorch] Enabling fused grouped MLP via `te.ops` requires `cudnn-frontend` library version `1.23.0`. In case of issues please ensure that the right version of `CuTeDSL` is correctly installed: ``` python -m pip uninstall -y \ cutlass \ nvidia-cutlass \ nvidia-cutlass-dsl \ nvidia-cutlass-dsl-libs-base \ nvidia-cutlass-dsl-libs-cu13 \ nvidia-cudnn-frontend python -m pip install -U pip setuptools wheel python -m pip install --no-cache-dir "nvidia-cutlass-dsl[cu13]==4.4.1" python -m pip install --no-cache-dir "nvidia-cudnn-frontend[cutedsl]==1.23.0" ``` ## Breaking Changes in This Release There are no breaking changes in this release. ## Deprecated Features There are no deprecated features in this release.