`core.fp4_utils`#

Utility functions related to FP4 that are used throughout Megatron core

Module Contents#

Functions#

`is_nvfp4tensor`	Check if a tensor is a Transformer Engine NVFP4Tensor.
`get_fp4_align_size`	Get the alignment size required for FP4 GEMM. FP4 GEMM requires Blackwell and later architectures.
`dequantize_fp4_tensor`	Dequantize a fp4 tensor to a higher precision tensor.

Data#

`HAVE_TE`
`HAVE_TE_FP4_TENSOR_CLASS`

API#

core.fp4_utils.HAVE_TE#: False

core.fp4_utils.HAVE_TE_FP4_TENSOR_CLASS#: False

core.fp4_utils.is_nvfp4tensor(tensor: torch.Tensor) → bool#: Check if a tensor is a Transformer Engine NVFP4Tensor.

core.fp4_utils.get_fp4_align_size(fp4_recipe: megatron.core.enums.Fp4Recipe) → int#

Get the alignment size required for FP4 GEMM. FP4 GEMM requires Blackwell and later architectures.

The value 32 is a hardware requirement: TMA (Tensor Memory Accelerator) requires a 16-byte aligned address for efficient memory access. Since FP4 uses 4 bits per value, 16 bytes (128 bits) corresponds to 32 FP4 values. Therefore, the alignment size for FP4 is 32. With this alignment, NVFP4 GEMM can be performed efficiently.

Note that since we are also random hadamard transform for NVFP4 training, we want fused group nvfp4 quantize plus hadamard transform. Hadamard transform will leverage tensor core instructions for better performance, while group quantize kernels also prefer a more aligned size in token dimension M. The efficiently leverage grouped kernels, padding needs to be 64 multiple, but 128 multiple will bring even faster.

When it comes to MOE cuda graph support, the number of tokens for each expert should be a buffer on device memory, which means that we don’t know the token dimension for each expertin host, therefore we cannot calculate the zero padded scaling factors shape on host to comply with the NVFP4 GEMM scaling factor layout. However, if we have already zero padded the tokens to 128 multiple, then there is no need for such padding, so that host doesn’t need to copy the token distribution from device to host (which will break the CUDA graph).

Paper link: https://arxiv.org/pdf/2509.25149 Scaling factor layout: https://docs.nvidia.com/cuda/cublas/#d-block-scaling-factors-layout TE NVFP4 Grouped Quantization: https://github.com/NVIDIA/TransformerEngine/pull/2411

core.fp4_utils.dequantize_fp4_tensor(fp4_tensor: torch.Tensor) → torch.Tensor#: Dequantize a fp4 tensor to a higher precision tensor.

core.fp4_utils#

Module Contents#

Functions#

Data#

API#

`core.fp4_utils`#