.. include:: /content/common.rsts Release Notes |ndash| Release 1.9 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Key Features and Enhancements @@@@@@@@@@@@@@@@@@@@@@@@@@@@@ - [PyTorch] Added support for sliding window attention in the cuDNN backend. - [PyTorch] Added an experimental `torch.nn.Sequential` style API for automatic operation based fusions. - [C/PyTorch] Added support for bottom-right aligned diagonal causal mask. - [C/PyTorch] Added support for grouped GEMM for MoE training. - [JAX] Added support for THD attention format. - [PaddlePaddle] Added support for CUDA graphs. - [PaddlePaddle] Added support for PaddlePaddle versions >= 2.6.1. Fixed Issues @@@@@@@@@@@@ - [PyTorch] Fixed incorrect outputs when handling non-contiguous input tensors. - [PyTorch] Fixed a hang in the `initialize_ub` function during multi-node runs, along with miscellaneous improvements in communication-GEMM overlap with userbuffers. - [PyTorch] Fixed convergence when using CPU offloading. - [PyTorch] Fixed a crash that occurred when using MoE, when an expert receives 0 tokens. - [JAX] Fixed a crash in newer JAX versions which restricted the output format of HLO lowering. - [PaddlePaddle] Fixed a crash when using the standalone column parallel linear API. - Fixed a numerical bug in the QGeLU activation. - Fixed a compilation bug in the core library with CUDA 12.1. - Fixed a bug selecting tuned RMSNorm kernels. - Fixed performance overheads by reducing the number of calls to the CUDA driver. Known Issues in This Release @@@@@@@@@@@@@@@@@@@@@@@@@@@@ There are no known issues in this release. Breaking Changes in This Release @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ There are no breaking changes in this release. Deprecated Features @@@@@@@@@@@@@@@@@@@ There are no deprecated features in this release.