Release Notes – Release 2.7¶
Key Features and Enhancements¶
[PyTorch] Added support for applying
LayerNorm
andRMSNorm
to key and query tensors.[PyTorch] Improved performance for FP8 per-tensor current scaling recipe by fusing amax computation into activation kernel.
[PyTorch] Added support for multi-tensor swizzle kernels for MXFP8 grouped GEMMs.
[PyTorch] Fused zero-padding and swizzle operation for MXFP8 scale inverses for improved performance.
[PyTorch] Expanded the debug API using
nvdlfw-inspect
in order to log more advanced tensor statistics.[PyTorch] Reduced the number of calls to the CUDA driver for improved performance in the core library.
[Jax] Added new checkpointing policies that allow users to switch to Transformer Engine GEMMs seamlessly without unnecessary recomputations.
[Core] Added support for the
cublasMP
backend for overlapping tensor parallel communication and GEMM.
Fixed Issues¶
[PyTorch] Fixed a potential illegal memory access when using userbuffers.for TP overlap.
[PyTorch] Fixed the logic for choosing the correct attention backend depending on the cuDNN version.
[PyTorch] Fixed a crash when using CUDA graphs by disabling garbage collection during capture.
[PyTorch] Fixed a bug when using double buffering for CPU offloading.
[PyTorch] Fixed a bug when overlapping gradient reduction and fusing weight gradient accumulation simultaneously.
[PyTorch] Made miscellaneous improvements and fixes to the Transformer Engine sequential API, including expanding supported operations such as dropout, constant scale, etc.
[PyTorch] Fixed a bug in the
make_graphed_callables
function when used on multiple modules with different input requirements.[PyTorch] Fixed the crash in the permute operation when running with the FP8 datatype for input sizes requiring padding.
[PyTorch] Fixed a bug when using the Triton cross entropy kernel with CUDA graphs.
[PyTorch] Fixed a bug when exporting an MXFP8 model to ONNX.
[PyTorch/Core] Disabled cuDNN attention backend for cuDNN v9.12 onwards on Blackwell if the user requests a deterministic configuration.
[Core] Fixed an integer overflow in quantization kernels when computing offsets for large tensors.
[Jax] Fixed partition rules for GEMM to handle sequence parallelism correctly.
[Jax] Fixed sharding specifications for Transformer Engine GEMM custom call operands when using data parallelism.
[Jax] Fixed a crash when using
GroupedQuantizeFFI
with CUDA graphs.[Jax] Fixed the
fused_attn
sharding constraint so that it can be used under the JAX shard_map.
Known Issues in This Release¶
There are no known issues in this release.
Breaking Changes in This Release¶
The deprecated device_id argument for multi0tensor C APIs has been removed.
Deprecated Features¶
There are no deprecated features in this release.
Miscellaneous¶
There are no miscellaneous issues in this release.