.. include:: /content/common.rsts .. |ge| replace:: :html:`≥` Release Notes |ndash| Release 2.7 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Key Features and Enhancements @@@@@@@@@@@@@@@@@@@@@@@@@@@@@ - [PyTorch] Added support for applying ``LayerNorm`` and ``RMSNorm`` to key and query tensors. - [PyTorch] Improved performance for FP8 per-tensor current scaling recipe by fusing amax computation into activation kernel. - [PyTorch] Added support for multi-tensor swizzle kernels for MXFP8 grouped GEMMs. - [PyTorch] Fused zero-padding and swizzle operation for MXFP8 scale inverses for improved performance. - [PyTorch] Expanded the debug API using ``nvdlfw-inspect`` in order to log more advanced tensor statistics. - [PyTorch] Reduced the number of calls to the CUDA driver for improved performance in the core library. - [Jax] Added new checkpointing policies that allow users to switch to Transformer Engine GEMMs seamlessly without unnecessary recomputations. - [Core] Added support for the ``cublasMP`` backend for overlapping tensor parallel communication and GEMM. Fixed Issues @@@@@@@@@@@@ - [PyTorch] Fixed a potential illegal memory access when using userbuffers.for TP overlap. - [PyTorch] Fixed the logic for choosing the correct attention backend depending on the cuDNN version. - [PyTorch] Fixed a crash when using CUDA graphs by disabling garbage collection during capture. - [PyTorch] Fixed a bug when using double buffering for CPU offloading. - [PyTorch] Fixed a bug when overlapping gradient reduction and fusing weight gradient accumulation simultaneously. - [PyTorch] Made miscellaneous improvements and fixes to the Transformer Engine sequential API, including expanding supported operations such as dropout, constant scale, etc. - [PyTorch] Fixed a bug in the ``make_graphed_callables`` function when used on multiple modules with different input requirements. - [PyTorch] Fixed the crash in the permute operation when running with the FP8 datatype for input sizes requiring padding. - [PyTorch] Fixed a bug when using the Triton cross entropy kernel with CUDA graphs. - [PyTorch] Fixed a bug when exporting an MXFP8 model to ONNX. - [PyTorch/Core] Disabled cuDNN attention backend for cuDNN v9.12 onwards on Blackwell if the user requests a deterministic configuration. - [Core] Fixed an integer overflow in quantization kernels when computing offsets for large tensors. - [Jax] Fixed partition rules for GEMM to handle sequence parallelism correctly. - [Jax] Fixed sharding specifications for Transformer Engine GEMM custom call operands when using data parallelism. - [Jax] Fixed a crash when using ``GroupedQuantizeFFI`` with CUDA graphs. - [Jax] Fixed the ``fused_attn`` sharding constraint so that it can be used under the JAX shard_map. Known Issues in This Release @@@@@@@@@@@@@@@@@@@@@@@@@@@@ There are no known issues in this release. Breaking Changes in This Release @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ * The deprecated *device_id* argument for multi0tensor C APIs has been removed. Deprecated Features @@@@@@@@@@@@@@@@@@@ There are no deprecated features in this release. Miscellaneous @@@@@@@@@@@@@ There are no miscellaneous issues in this release.