to_block_scale#
-
nvmath.
linalg. advanced. helpers. matmul. to_block_scale( - scale_tensor: torch.Tensor,
- operand_or_shape: torch.Tensor | tuple[int, ...],
- block_scaling_format: BlockScalingFormat,
- *,
- axis: Literal[-1, -2] | None = None,
- out: torch.Tensor | None = None,
This function is experimental and potentially subject to future changes.
Copy ND scale tensor to flat tensor accounting for the tiled layout required by cuBLASLt.
Matmul (cuBLAS) expects scale factors in specific interleaved layout.
This function aims to abstract away the interleaved layout details, offering a way to specify scales as ND tensor with shape corresponding to the operand’s shape and copy them to cuBLAS-compatible interleaved layout.
Example
Suppose that you are doing an NVFP4 matmul
a @ bwithaof shape (M=128, K=128). For matrixa, a single scale is applied to consecutive 16 elements blocks in a row (axis=-1). You can specify the block scales as a ND tensor with shape(M, K // 16)such that scale fromscale_tensor[i, j]will be applied to the block of elementsa[i, j*16:j*16+16]and then callto_block_scale(scale_tensor, a, BlockScalingFormat.NVFP4), which will return a 1D interleaved scale tensor that can be passed as quantization scales for the matmul.Note
As far as computing the block scale offset, the only difference between MXFP8 and NVFP4 is the number of elements in a block (32 for MXFP8, 16 for NVFP4).
- Parameters:
scale_tensor –
ND scale tensor with dtype:
for NVFP4:
torch.float8_e4m3fnortorch.uint8(interpreted astorch.float8_e4m3fn)for MXFP8:
torch.uint8(interpreted asUE8M0)
operand_or_shape – Operand tensor (that the scales apply to) or the operand’s logical (non-packed, non-blocked) shape.
block_scaling_format – The block scaling format of the operand:
BlockScalingFormat.NVFP4orBlockScalingFormat.MXFP8. Internally, it is validated to be consistent with the operand dtype, and aValueErroris raised if not.axis –
The blocked dimension of the operand tensor. For example, for NVFP4/MXFP8 matmul, A is blocked in rows (
axis = -1), and B is blocked in columns (axis = -2). Depending onoperand_or_shape:if a shape is passed to
operand_or_shape, thenaxisis requiredif an operand is passed to
operand_or_shape, thenaxiscan be omitted and the blocked dimension is inferred from the operand’s layout.
out – Output tensor to copy the scales to. If
None, a new tensor is created.
- Returns:
Flat
outtensor containing the scales copied to match cuBLAS-compatible interleaved layout. Theoutdtype is the same as thescale_tensordtype.