.. include:: /content/common.rsts

.. |ge|      replace::  :html:`&ge;`

Release Notes |ndash| Release 2.10
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Key Features and Enhancements
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

- [PyTorch] Added support for the NVFP4 training recipe for the GroupedLinear module.
- [PyTorch] Added support for CUDA graphs when using quantized weights with Tensor Parallelism.
- [PyTorch] Added support for CUDA graphs when using delay_wgrad_compute.
- [PyTorch] Expanded debug tools to support more statistics.
- [PyTorch] Reduced the overhead of using debug tools.
- [PyTorch] Added support for clamped SwiGLU in the TransformerLayer module.
- [PyTorch] Added backwards compatibility for older Megatron-Core versions by introducing a keep_columnwise parameter to cast_master_weights_to_fp8 and related helper functions.
- [PyTorch] Added a reset interface to make_graphed_callables that clears internal CUDA graphs before distributed process group cleanup, preventing hangs.
- [PyTorch] Added support for FSDP2 with quantized weights.
- [PyTorch] Added support for Sliding Window Attention (SWA) with Context Parallelism with THD input format.
- [PyTorch] Integrated Flash Attention's num_splits parameters into the attention backend.
- [PyTorch] Made various improvements to mitigate CPU overhead, especially for the GroupedLinear module.
- [C][PyTorch] Enabled RoPE (Rotary Position Embedding) application with position offsets during training, removing the previous restriction that start_positions could only be used with cp_size=1 (context parallelism disabled).
- [Jax] Added options to disable Stochastic Rounding, Randomized Hadamard Transform, and 2D weight quantization in the NVFP4 training recipe.
- [Jax] Improved performance by using Transformer Engine quantization when fused normalization or fused activation are disabled.
- [Jax] Performance Improvement for NVFP4 via TE kernels for scaling factor swizzles.
- [Jax] Added support for checkpointing quantization operations in JAX.
- [Jax] Added support for sink attention.
- [Jax] Added support for concurrent use of Data Parallelism (DP) and Fully-Sharded Data Parallelism (FSDP).

Fixed Issues
@@@@@@@@@@@@


- Fixed an occasional crash when loading cuDNN library during runtime.
- [C] Fixed an out of bounds access in the NVFP4 dequantization kernel.
- [C] Fixed a numerical error in the amax computation in normalization kernels.
- [PyTorch] Fixed a crash in the permute kernel when using triton v3.5.
- [PyTorch] Fixed a numerical issue when using gradient accumulation fusion with FSDP.
- [PyTorch] Fixed a crash when exporting modules via ONNX when using RMSNorm.
- [Jax] Fixed a partitioning issue for the NVFP4 training recipe with 1D Mesh.
- [Jax] Fixed a bug where the bias parameter could be added twice
  when using unfused attention backend.
- [Jax] Fixed a sharding bug in ring attention primitives
  when using packed sequences where segment position tensors were not properly sharded to match their corresponding segment ID tensors.
- [PyTorch][Jax] Fixed various logical issues
  in the backend selection process for attention.


Known Issues in This Release
@@@@@@@@@@@@@@@@@@@@@@@@@@@@

There are no known issues in this release.


Breaking Changes in This Release
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@


- [Jax] Default value for intermediate_dropout changed from 0.1 to 0.0.
- [Jax] Default value for return_layernorm_output changed from True to False.
- [Jax] Default activation changed from ReLU to GeLU.
- [Jax] Default input type for DotProductAttention is changed to BSHD.


.. _DeprecatedFeatures:

Deprecated Features
@@@@@@@@@@@@@@@@@@@

There are no deprecated features in this release.