.. include:: /content/common.rsts .. |ge| replace:: :html:`≥` Release Notes |ndash| Release 2.10 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Key Features and Enhancements @@@@@@@@@@@@@@@@@@@@@@@@@@@@@ - [PyTorch] Added support for the NVFP4 training recipe for the GroupedLinear module. - [PyTorch] Added support for CUDA graphs when using quantized weights with Tensor Parallelism. - [PyTorch] Added support for CUDA graphs when using delay_wgrad_compute. - [PyTorch] Expanded debug tools to support more statistics. - [PyTorch] Reduced the overhead of using debug tools. - [PyTorch] Added support for clamped SwiGLU in the TransformerLayer module. - [PyTorch] Added backwards compatibility for older Megatron-Core versions by introducing a keep_columnwise parameter to cast_master_weights_to_fp8 and related helper functions. - [PyTorch] Added a reset interface to make_graphed_callables that clears internal CUDA graphs before distributed process group cleanup, preventing hangs. - [PyTorch] Added support for FSDP2 with quantized weights. - [PyTorch] Added support for Sliding Window Attention (SWA) with Context Parallelism with THD input format. - [PyTorch] Integrated Flash Attention's num_splits parameters into the attention backend. - [PyTorch] Made various improvements to mitigate CPU overhead, especially for the GroupedLinear module. - [C][PyTorch] Enabled RoPE (Rotary Position Embedding) application with position offsets during training, removing the previous restriction that start_positions could only be used with cp_size=1 (context parallelism disabled). - [Jax] Added options to disable Stochastic Rounding, Randomized Hadamard Transform, and 2D weight quantization in the NVFP4 training recipe. - [Jax] Improved performance by using Transformer Engine quantization when fused normalization or fused activation are disabled. - [Jax] Performance Improvement for NVFP4 via TE kernels for scaling factor swizzles. - [Jax] Added support for checkpointing quantization operations in JAX. - [Jax] Added support for sink attention. - [Jax] Added support for concurrent use of Data Parallelism (DP) and Fully-Sharded Data Parallelism (FSDP). Fixed Issues @@@@@@@@@@@@ - Fixed an occasional crash when loading cuDNN library during runtime. - [C] Fixed an out of bounds access in the NVFP4 dequantization kernel. - [C] Fixed a numerical error in the amax computation in normalization kernels. - [PyTorch] Fixed a crash in the permute kernel when using triton v3.5. - [PyTorch] Fixed a numerical issue when using gradient accumulation fusion with FSDP. - [PyTorch] Fixed a crash when exporting modules via ONNX when using RMSNorm. - [Jax] Fixed a partitioning issue for the NVFP4 training recipe with 1D Mesh. - [Jax] Fixed a bug where the bias parameter could be added twice when using unfused attention backend. - [Jax] Fixed a sharding bug in ring attention primitives when using packed sequences where segment position tensors were not properly sharded to match their corresponding segment ID tensors. - [PyTorch][Jax] Fixed various logical issues in the backend selection process for attention. Known Issues in This Release @@@@@@@@@@@@@@@@@@@@@@@@@@@@ There are no known issues in this release. Breaking Changes in This Release @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ - [Jax] Default value for intermediate_dropout changed from 0.1 to 0.0. - [Jax] Default value for return_layernorm_output changed from True to False. - [Jax] Default activation changed from ReLU to GeLU. - [Jax] Default input type for DotProductAttention is changed to BSHD. .. _DeprecatedFeatures: Deprecated Features @@@@@@@@@@@@@@@@@@@ There are no deprecated features in this release.