Release Notes – Release 2.9¶
Key Features and Enhancements¶
[PyTorch][Jax] Introduced-recipe agnostic functions and APIs to generalize to non-FP8 recipes. See Deprecated Features, below, for a comprehensive list of affected APIs.
[C][PyTorch][Jax] Added support for the clamped SwiGLU activation function.
[C] Added support for precompiled wheels for cuda13 via PyPI.
[PyTorch] Added support for custom training recipes in the autocast context. Transformer Engine quantizers and quantized tensors classes, as well as storage data classes, are now a part of the public API.
[PyTorch] Added CPU offload support for all attention layouts.
[PyTorch] Added support for the FP8 block scaling recipe (as used in the DeepSeek v3 Technical Report) on NVIDIA Blackwell architecture (the SM100 family).
[PyTorch] Added support for gradient accumulation fusion when using FSDP.
[PyTorch] Added support for CPU offloading when using GroupedLinear with distributed optimizer.
[PyTorch] Exposed as public the following API utility functions:
is_fp8_available,is_mxfp8_available,is_fp8_block_scaling_available,is_nvfp4_available,is_bf16_available,get_cudnn_version,get_device_compute_capability, andget_default_recipe.[PyTorch] Added max_logit support for the MuonClip optimizer.
[PyTorch][Jax] Improved the logic for selecting the attention backend, addressing various unsupported cases and preventing errors.
[Jax] Added support for the NVFP4 training recipe.
[Jax] Improved the performance of the current scaling recipes by enabling fused amax calculation in normalization and activation kernels.
[Jax] Added support for bottom right causal mask for THD attention. Improved documentation and tutorials for the NVFP4 recipe.
Fixed Issues¶
[Jax] Fixed a crash that occurred when using Context Parallelism with ring attention.
[Jax] Fixed an issue with incorrect sharding when
get_all_mesh_axesis used.[Jax] Fixed a numerical issue when using bias with Tensor Parallelism.
[PyTorch] Fixed an integer overflow issue in the Triton permute kernel.
[PyTorch] Fixed the known issue from release_v2.8 which resulted in worse performance for the FP8 current scaling recipe.
Fixed a build issue that occurred when cuDNN was installed into a custom location or a Python virtual environment.
Known Issues in This Release¶
[C][PyTorch] The cuDNN attention backend produces NaNs in the forward pass for cases that use a non-causal mask with cuDNN 9.13 and cuDNN 9.14. As a workaround, set the environment variable
NVTE_FUSED_ATTNto 0 when using this configuration.[C][PyTorch] The backward pass of cuDNN attention is incompatible with CUDA graphs for BSHD inputs when the sequence (S) dimension is not divisible by 128 and is used with a non-padding mask. As a workaround, set the environment variable
NVTE_FUSED_ATTNto 0 when using this configuration.
Breaking Changes in This Release¶
There are no breaking changes in this release.
Deprecated Features¶
[PyTorch] The function
fp8_autocastis deprecated in favor of autocast. The newautocastfunction uses arguments recipe and amax_reduction_group. instead of fp8_recipe and fp8_group respectively.[PyTorch] The function
fp8_model_initis deprecated in favor ofquantized_model_init.[PyTorch] The arguments fp8_enabled, fp8_calibrating, fp8_recipe, fp8_group, and fp8_weight_caching in function
make_graphed_callablesare deprecated in favor of enabled, calibrating, recipe, amax_reduction_group, and cache_quantized_params, respectively.[Jax] The function
fp8_autocastis deprecated in favor ofautocast.
Miscellaneous¶
There are no miscellaneous issues in this release.