Release Notes – Release 2.4¶

Key Features and Enhancements¶

[Jax] Added support for Float8CurrentScaling recipe.
[Jax] Added support for logical partitioning axes in TE Flax modules.
[Core] Added multiple experimental functions to the C API.
[PyTorch] Improved performance by caching device properties.
[PyTorch] Made miscellaneous minor improvements to reduce memory consumption for certain workloads.
[PyTorch] Added support for MXFP8 recipe when using userbuffers for overlapping TP communication and GEMMs.
[PyTorch] Reduced the binary size of the framework extension library from 108 MB to 2 MB.
[PyTorch] Introduced a Boolean parameter, rotary_pos_interleaved, in the MultiheadAttention and TransformerLayer modules for interleaved RoPE.
[PyTorch] Added support for ignoring tokens in the cross-entropy loss function.
[PyTorch] Added support for switching among all supported FP8 recipes during training and checkpointing.
[PyTorch] Added various debugging tools via NVIDIA-DL-Framework-Inspect.

[PyTorch] Fixed a numerical issue when using activation recompute with FP8.
[PyTorch] Fixed incorrect output dimensions when using return_layernorm_output in the LayerNormLinear and LayerNormMLP modules.
[PyTorch] Fixed a numerical bug when using sequence parallel in the LayerNorm and RMSNorm modules.
[PyTorch/Jax] Fixed miscellaneous crashes at import time due to library loading.
[Jax] Fixed a crash due to partitioning error when using the LayerNorm or LayerNormMLP module with tensor parallelism.
[PyTorch] Fixed an issue where GIL was held during the entirety of C API calls from the framework extensions, including during NVIDIA^® CUDA^® kernel execution.

There are no known issues in this release.

There are no breaking changes in this release.

There are no deprecated features in this release.

There are no miscellaneous issues in this release.