Release Notes Release 2.6

Key Features and Enhancements

  • [PyTorch] Added support for gradient accumulation fusion when using FSDP from megatron-core.

  • [PyTorch] Optimized memory usage when using NVIDIA® CUDA® graphs with TE using the make_graphed_callables function.

  • [PyTorch] Optimized performance of permute fusion kernels for MoE.

  • [PyTorch] Added support for ONNX export of Transformer Engine modules.

  • [PyTorch] Added a save_original_input option to the Linear and GroupedLinear modules to decouple row-wise (forward) and column-wise (backward) quantization. This option saves memory for certain workloads and training recipes.

  • [PyTorch] Improved performance of MXFP8 quantization kernels.

  • [Core] Improved performance of KV caching kernels.

Fixed Issues

  • [PyTorch] Fixed an issue in the LayerNormLinear module where the returned normalization output was of different shape than the input tensor.

  • [PyTorch] Fixed an issue with the align_size calculation in FP8 padding/unpadding modules.

  • [PyTorch] Made miscellaneous fixes and enhancements to the fusible ops (te.sequential) API.

  • [PyTorch] Reduced CPU overhead in various workloads: DelayedScaling recipe, MXFP8 MoE, and pipeline parallelism.

    Todo

    Is the above a list of affected workloads, or examples? If a complete list, better to save_original_input``ay, “…in DelayedScaling recipe, MXFP8 MoE, and pipeline parallelism.”

  • [PyTorch] Fixed a bug in the multi-tensor adam kernel that incorrectly downcast an FP32 tensor to BF16.

  • [PyTorch] Fixed an issue with caching FP8 weights when running validation steps between training steps.

  • [PyTorch] Fixed a logical error that could lead to using an incorrect attention backend when a better-performing backend is available.

    Todo

    “Incorrect” is incorrect, because a backend is not wrong simply because its performance is less. I suggest “sub-optimal,” or choose an adjective that describes what makes the backend sub-optimal; e.g., “older.”

  • [PyTorch] Fixed miscellaneous errors during runtime loading of shared libraries by expanding search paths.

  • [PyTorch] Fixed a “use after-free” in cases where quantization and normalization are unfused.

    Todo

    If “use after-free” is a condensed form of “use it after it is free,” it should be “use-after-free” or “use after free.” If it’s not already a recognized term, would prefer the latter.

  • [Jax] Fixed a crash with grouped GEMM in CUDA version 12.9.1.

  • [Jax] Fixed build with JAX v0.7.0 that failed due to removal of jax.extend.ffi.

Known Issues in This Release

There are no known issues in this release.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

There are no deprecated features in this release.

Miscellaneous

There are no miscellaneous issues in this release.