Release Notes – Release 2.6¶

Key Features and Enhancements¶

[PyTorch] Added support for gradient accumulation fusion when using FSDP from megatron-core.
[PyTorch] Optimized memory usage when using NVIDIA^® CUDA^® graphs with TE using the make_graphed_callables function.
[PyTorch] Optimized performance of permute fusion kernels for MoE.
[PyTorch] Added support for ONNX export of Transformer Engine modules.
[PyTorch] Added a save_original_input option to the Linear and GroupedLinear modules to decouple row-wise (forward) and column-wise (backward) quantization. This option saves memory for certain workloads and training recipes.
[PyTorch] Improved performance of MXFP8 quantization kernels.
[Core] Improved performance of KV caching kernels.

[PyTorch] Fixed an issue in the LayerNormLinear module where the returned normalization output was of different shape than the input tensor.
[PyTorch] Fixed an issue with the align_size calculation in FP8 padding/unpadding modules.
[PyTorch] Made miscellaneous fixes and enhancements to the fusible ops (te.sequential) API.
[PyTorch] Reduced CPU overhead in these workloads: DelayedScaling recipe, MXFP8 MoE, and pipeline parallelism.
[PyTorch] Fixed a bug in the multi-tensor adam kernel that incorrectly downcast an FP32 tensor to BF16.
[PyTorch] Fixed an issue with caching FP8 weights when running validation steps between training steps.
[PyTorch] Fixed a logical error that could lead to using a sub-optimal attention backend when a better-performing backend is available.
[PyTorch] Fixed miscellaneous errors during runtime loading of shared libraries by expanding search paths.
[PyTorch] Fixed a use-after-free bug in cases where quantization and normalization are unfused.
[Jax] Fixed a crash with grouped GEMM in CUDA version ≥ 12.9.1.
[Jax] Fixed build with JAX v0.7.0 that failed due to removal of jax.extend.ffi.

There are no known issues in this release.

There are no breaking changes in this release.

There are no deprecated features in this release.

There are no miscellaneous issues in this release.