.. include:: /content/common.rsts

.. |ge|      replace::  :html:`&ge;`

Release Notes |ndash| Release 2.7
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Key Features and Enhancements
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

- [PyTorch] Added support for applying ``LayerNorm`` and ``RMSNorm``
  to key and query tensors.
- [PyTorch] Improved performance for FP8 per-tensor current scaling recipe
  by fusing amax computation into activation kernel.
- [PyTorch] Added support for multi-tensor swizzle kernels
  for MXFP8 grouped GEMMs.
- [PyTorch] Fused zero-padding and swizzle operation
  for MXFP8 scale inverses for improved performance.
- [PyTorch] Expanded the debug API using ``nvdlfw-inspect``
  in order to log more advanced tensor statistics.
- [PyTorch] Reduced the number of calls to the CUDA driver
  for improved performance in the core library.
- [Jax] Added new checkpointing policies
  that allow users to switch to Transformer Engine GEMMs seamlessly
  without unnecessary recomputations.
- [Core] Added support for the ``cublasMP`` backend
  for overlapping tensor parallel communication and GEMM.
  

Fixed Issues
@@@@@@@@@@@@

- [PyTorch] Fixed a potential illegal memory access
  when using userbuffers.for TP overlap.
- [PyTorch] Fixed the logic for choosing the correct attention backend
  depending on the cuDNN version.
- [PyTorch] Fixed a crash when using CUDA graphs
  by disabling garbage collection during capture.
- [PyTorch] Fixed a bug when using double buffering for CPU offloading.
- [PyTorch] Fixed a bug when overlapping gradient reduction
  and fusing weight gradient accumulation simultaneously.
- [PyTorch] Made miscellaneous improvements and fixes
  to the Transformer Engine sequential API, including expanding supported operations such as dropout, constant scale, etc.
- [PyTorch] Fixed a bug in the ``make_graphed_callables`` function
  when used on multiple modules with different input requirements.
- [PyTorch] Fixed the crash in the permute operation
  when running with the FP8 datatype for input sizes requiring padding.
- [PyTorch] Fixed a bug when using the Triton cross entropy kernel
  with CUDA graphs.
- [PyTorch] Fixed a bug when exporting an MXFP8 model to ONNX.
- [PyTorch/Core] Disabled cuDNN attention backend for cuDNN v9.12 onwards
  on Blackwell if the user requests a deterministic configuration.
- [Core] Fixed an integer overflow in quantization kernels
  when computing offsets for large tensors.
- [Jax] Fixed partition rules for GEMM
  to handle sequence parallelism correctly.
- [Jax] Fixed sharding specifications
  for Transformer Engine GEMM custom call operands when using data parallelism.
- [Jax] Fixed a crash when using ``GroupedQuantizeFFI``
  with CUDA graphs.
- [Jax] Fixed the ``fused_attn`` sharding constraint
  so that it can be used under the JAX shard_map.


Known Issues in This Release
@@@@@@@@@@@@@@@@@@@@@@@@@@@@

There are no known issues in this release.


Breaking Changes in This Release
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

* The deprecated *device_id* argument for multi0tensor C APIs
  has been removed.

Deprecated Features
@@@@@@@@@@@@@@@@@@@

There are no deprecated features in this release.


Miscellaneous
@@@@@@@@@@@@@

There are no miscellaneous issues in this release.