core.fp4_utils#
Utility functions related to FP4 that are used throughout Megatron core
Module Contents#
Functions#
Check if a tensor is a Transformer Engine NVFP4Tensor. |
|
Get the alignment size required for FP4 GEMM. FP4 GEMM requires Blackwell and later architectures. |
|
Dequantize a fp4 tensor to a higher precision tensor. |
Data#
API#
- core.fp4_utils.HAVE_TE#
False
- core.fp4_utils.HAVE_TE_FP4_TENSOR_CLASS#
False
- core.fp4_utils.is_nvfp4tensor(tensor: torch.Tensor) bool#
Check if a tensor is a Transformer Engine NVFP4Tensor.
- core.fp4_utils.get_fp4_align_size(fp4_recipe: megatron.core.enums.Fp4Recipe) int#
Get the alignment size required for FP4 GEMM. FP4 GEMM requires Blackwell and later architectures.
The value 32 is a hardware requirement: TMA (Tensor Memory Accelerator) requires a 16-byte aligned address for efficient memory access. Since FP4 uses 4 bits per value, 16 bytes (128 bits) corresponds to 32 FP4 values. Therefore, the alignment size for FP4 is 32. With this alignment, NVFP4 GEMM can be performed efficiently.
Note that since we are also random hadamard transform for NVFP4 training, we want fused group nvfp4 quantize plus hadamard transform. Hadamard transform will leverage tensor core instructions for better performance, while group quantize kernels also prefer a more aligned size in token dimension M. Therefore, we apply align size 64 here for better performance in MOE.
Paper link: https://arxiv.org/pdf/2509.25149
- core.fp4_utils.dequantize_fp4_tensor(fp4_tensor: torch.Tensor) torch.Tensor#
Dequantize a fp4 tensor to a higher precision tensor.