.. include:: /content/common.rsts .. |ge| replace:: :html:`≥` Release Notes |ndash| Release 2.9 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Key Features and Enhancements @@@@@@@@@@@@@@@@@@@@@@@@@@@@@ - [PyTorch][Jax] Introduced-recipe agnostic functions and APIs to generalize to non-FP8 recipes. See :ref:`Deprecated Features `, below, for a comprehensive list of affected APIs. - [C][PyTorch][Jax] Added support for the clamped SwiGLU activation function. - [C] Added support for precompiled wheels for cuda13 via PyPI. - [PyTorch] Added support for custom training recipes in the `autocast` context. Transformer Engine quantizers and quantized tensors classes, as well as storage data classes, are now a part of the public API. - [PyTorch] Added CPU offload support for all attention layouts. - [PyTorch] Added support for the FP8 block scaling recipe (as used in the `DeepSeek v3 Technical Report `_) on NVIDIA Blackwell architecture (the SM100 family). - [PyTorch] Added support for gradient accumulation fusion when using FSDP. - [PyTorch] Added support for CPU offloading when using `GroupedLinear` with distributed optimizer. - [PyTorch] Exposed as public the following API utility functions: ``is_fp8_available``, ``is_mxfp8_available``, ``is_fp8_block_scaling_available``, ``is_nvfp4_available``, ``is_bf16_available``, ``get_cudnn_version``, ``get_device_compute_capability``, and ``get_default_recipe``. - [PyTorch] Added `max_logit` support for the MuonClip optimizer. - [PyTorch][Jax] Improved the logic for selecting the attention backend, addressing various unsupported cases and preventing errors. - [Jax] Added support for the NVFP4 training recipe. - [Jax] Improved the performance of the current scaling recipes by enabling fused amax calculation in normalization and activation kernels. - [Jax] Added support for bottom right causal mask for THD attention. Improved documentation and tutorials for the NVFP4 recipe. Fixed Issues @@@@@@@@@@@@ - [Jax] Fixed a crash that occurred when using Context Parallelism with ring attention. - [Jax] Fixed an issue with incorrect sharding when ``get_all_mesh_axes`` is used. - [Jax] Fixed a numerical issue when using bias with Tensor Parallelism. - [PyTorch] Fixed an integer overflow issue in the Triton permute kernel. - [PyTorch] Fixed the known issue from `release_v2.8` which resulted in worse performance for the FP8 current scaling recipe. - Fixed a build issue that occurred when cuDNN was installed into a custom location or a Python virtual environment. Known Issues in This Release @@@@@@@@@@@@@@@@@@@@@@@@@@@@ - [C][PyTorch] The cuDNN attention backend produces NaNs in the forward pass for cases that use a non-causal mask with cuDNN 9.13 and cuDNN 9.14. As a workaround, set the environment variable ``NVTE_FUSED_ATTN`` to 0 when using this configuration. - [C][PyTorch] The backward pass of cuDNN attention is incompatible with CUDA graphs for BSHD inputs when the sequence (S) dimension is not divisible by 128 and is used with a non-padding mask. As a workaround, set the environment variable ``NVTE_FUSED_ATTN`` to 0 when using this configuration. Breaking Changes in This Release @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ There are no breaking changes in this release. .. _DeprecatedFeatures: Deprecated Features @@@@@@@@@@@@@@@@@@@ - [PyTorch] The function ``fp8_autocast`` is deprecated in favor of `autocast`. The new ``autocast`` function uses arguments *recipe* and *amax_reduction_group*. instead of `fp8_recipe` and `fp8_group` respectively. - [PyTorch] The function ``fp8_model_init`` is deprecated in favor of ``quantized_model_init``. - [PyTorch] The arguments *fp8_enabled*, *fp8_calibrating*, *fp8_recipe*, *fp8_group*, and *fp8_weight_caching* in function ``make_graphed_callables`` are deprecated in favor of *enabled*, *calibrating*, *recipe*, *amax_reduction_group*, and *cache_quantized_params*, respectively. - [Jax] The function ``fp8_autocast`` is deprecated in favor of ``autocast``. Miscellaneous @@@@@@@@@@@@@ There are no miscellaneous issues in this release.