`core.fp4_utils`#

Utility functions related to FP4 that are used throughout Megatron core

Module Contents#

Functions#

`is_nvfp4tensor`	Check if a tensor is a Transformer Engine NVFP4Tensor.
`get_nvfp4_rowwise_packed_shape`	Return packed byte shape for NVFP4 rowwise storage (last dim // 2).
`modify_nvfp4_rowwise_storage`	Replace NVFP4 tensor’s rowwise raw data with a new uint8 storage view.
`quantize_nvfp4_param_shard`	Cast shard FP32 master weights to NVFP4 model params (rowwise/columnwise).
`get_fp4_align_size`	Get the alignment size required for FP4 GEMM. FP4 GEMM requires Blackwell and later architectures.
`dequantize_fp4_tensor`	Dequantize a fp4 tensor to a higher precision tensor.

Data#

`HAVE_TE`
`HAVE_TE_FP4_TENSOR_CLASS`

API#

core.fp4_utils.HAVE_TE#: False

core.fp4_utils.HAVE_TE_FP4_TENSOR_CLASS#: False

core.fp4_utils.is_nvfp4tensor(tensor: torch.Tensor) → bool#: Check if a tensor is a Transformer Engine NVFP4Tensor.

core.fp4_utils.get_nvfp4_rowwise_packed_shape(shape: torch.Size) → torch.Size#: Return packed byte shape for NVFP4 rowwise storage (last dim // 2).

core.fp4_utils.modify_nvfp4_rowwise_storage( fp4_tensor: torch.Tensor, new_rowwise_data: torch.Tensor, ) → None#

Replace NVFP4 tensor’s rowwise raw data with a new uint8 storage view.

Copies existing bytes into the new buffer, then swaps the underlying pointer.

core.fp4_utils.quantize_nvfp4_param_shard( model_params, main_params, start_offsets, data_parallel_group, fsdp_shard_model_params=None, )#

Cast shard FP32 master weights to NVFP4 model params (rowwise/columnwise).

This function wraps Transformer Engine’s quantize_master_weights, which handles:

Two-level NVFP4 scaling (global FP32 scale + per-block FP8 E4M3 scale)
Partial casting with nibble-accurate updates
Coordinated amax reduction across data parallel group

Parameters:

model_params – List of NVFP4 model parameters (NVFP4Tensor).
main_params – List of FP32 master weights (shards).
start_offsets – List of starting offsets in the full model weight for each shard.
data_parallel_group – Distributed group for amax reduction.
fsdp_shard_model_params – Optional list of FSDP sharded model params.

core.fp4_utils.get_fp4_align_size(fp4_recipe: megatron.core.enums.Fp4Recipe) → int#

Get the alignment size required for FP4 GEMM. FP4 GEMM requires Blackwell and later architectures.

The value 32 is a hardware requirement: TMA (Tensor Memory Accelerator) requires a 16-byte aligned address for efficient memory access. Since FP4 uses 4 bits per value, 16 bytes (128 bits) corresponds to 32 FP4 values. Therefore, the alignment size for FP4 is 32. With this alignment, NVFP4 GEMM can be performed efficiently.

Note that since we are also random hadamard transform for NVFP4 training, we want fused group nvfp4 quantize plus hadamard transform. Hadamard transform will leverage tensor core instructions for better performance, while group quantize kernels also prefer a more aligned size in token dimension M. The efficiently leverage grouped kernels, padding needs to be 64 multiple, but 128 multiple will bring even faster.

When it comes to MOE cuda graph support, the number of tokens for each expert should be a buffer on device memory, which means that we don’t know the token dimension for each expertin host, therefore we cannot calculate the zero padded scaling factors shape on host to comply with the NVFP4 GEMM scaling factor layout. However, if we have already zero padded the tokens to 128 multiple, then there is no need for such padding, so that host doesn’t need to copy the token distribution from device to host (which will break the CUDA graph).

Paper link: https://arxiv.org/pdf/2509.25149 Scaling factor layout: https://docs.nvidia.com/cuda/cublas/#d-block-scaling-factors-layout TE NVFP4 Grouped Quantization: https://github.com/NVIDIA/TransformerEngine/pull/2411

core.fp4_utils.dequantize_fp4_tensor(fp4_tensor: torch.Tensor) → torch.Tensor#: Dequantize a fp4 tensor to a higher precision tensor.

core.fp4_utils#

Module Contents#

Functions#

Data#

API#

`core.fp4_utils`#