Release Notes – Release 2.6¶
Key Features and Enhancements¶
[PyTorch] Added support for gradient accumulation fusion when using FSDP from megatron-core.
[PyTorch] Optimized memory usage when using NVIDIA® CUDA® graphs with TE using the
make_graphed_callables
function.[PyTorch] Optimized performance of permute fusion kernels for MoE.
[PyTorch] Added support for ONNX export of Transformer Engine modules.
[PyTorch] Added a
save_original_input
option to theLinear
andGroupedLinear
modules to decouple row-wise (forward) and column-wise (backward) quantization. This option saves memory for certain workloads and training recipes.[PyTorch] Improved performance of MXFP8 quantization kernels.
[Core] Improved performance of KV caching kernels.
Fixed Issues¶
[PyTorch] Fixed an issue in the
LayerNormLinear
module where the returned normalization output was of different shape than the input tensor.[PyTorch] Fixed an issue with the
align_size
calculation in FP8 padding/unpadding modules.[PyTorch] Made miscellaneous fixes and enhancements to the fusible ops (
te.sequential
) API.[PyTorch] Reduced CPU overhead in various workloads: DelayedScaling recipe, MXFP8 MoE, and pipeline parallelism.
Todo
Is the above a list of affected workloads, or examples? If a complete list, better to save_original_input``ay, “…in DelayedScaling recipe, MXFP8 MoE, and pipeline parallelism.”
[PyTorch] Fixed a bug in the multi-tensor adam kernel that incorrectly downcast an FP32 tensor to BF16.
[PyTorch] Fixed an issue with caching FP8 weights when running validation steps between training steps.
[PyTorch] Fixed a logical error that could lead to using an incorrect attention backend when a better-performing backend is available.
Todo
“Incorrect” is incorrect, because a backend is not wrong simply because its performance is less. I suggest “sub-optimal,” or choose an adjective that describes what makes the backend sub-optimal; e.g., “older.”
[PyTorch] Fixed miscellaneous errors during runtime loading of shared libraries by expanding search paths.
[PyTorch] Fixed a “use after-free” in cases where quantization and normalization are unfused.
Todo
If “use after-free” is a condensed form of “use it after it is free,” it should be “use-after-free” or “use after free.” If it’s not already a recognized term, would prefer the latter.
[Jax] Fixed a crash with grouped GEMM in CUDA version ≥ 12.9.1.
[Jax] Fixed build with JAX v0.7.0 that failed due to removal of
jax.extend.ffi
.
Known Issues in This Release¶
There are no known issues in this release.
Breaking Changes in This Release¶
There are no breaking changes in this release.
Deprecated Features¶
There are no deprecated features in this release.
Miscellaneous¶
There are no miscellaneous issues in this release.