to_block_scale#

nvmath.linalg.advanced.helpers.matmul.to_block_scale(
scale_tensor: torch.Tensor,
operand_or_shape: torch.Tensor | tuple[int, ...],
block_scaling_format: BlockScalingFormat,
*,
axis: Literal[-1, -2] | None = None,
out: torch.Tensor | None = None,
) torch.Tensor[source]#

This function is experimental and potentially subject to future changes.

Copy ND scale tensor to flat tensor accounting for the tiled layout required by cuBLASLt.

Matmul (cuBLAS) expects scale factors in specific interleaved layout.

This function aims to abstract away the interleaved layout details, offering a way to specify scales as ND tensor with shape corresponding to the operand’s shape and copy them to cuBLAS-compatible interleaved layout.

Example

Suppose that you are doing an NVFP4 matmul a @ b with a of shape (M=128, K=128). For matrix a, a single scale is applied to consecutive 16 elements blocks in a row (axis=-1). You can specify the block scales as a ND tensor with shape (M, K // 16) such that scale from scale_tensor[i, j] will be applied to the block of elements a[i, j*16:j*16+16] and then call to_block_scale(scale_tensor, a, BlockScalingFormat.NVFP4), which will return a 1D interleaved scale tensor that can be passed as quantization scales for the matmul.

Note

As far as computing the block scale offset, the only difference between MXFP8 and NVFP4 is the number of elements in a block (32 for MXFP8, 16 for NVFP4).

Parameters:
  • scale_tensor

    ND scale tensor with dtype:

    • for NVFP4: torch.float8_e4m3fn or torch.uint8 (interpreted as torch.float8_e4m3fn)

    • for MXFP8: torch.uint8 (interpreted as UE8M0)

  • operand_or_shape – Operand tensor (that the scales apply to) or the operand’s logical (non-packed, non-blocked) shape.

  • block_scaling_format – The block scaling format of the operand: BlockScalingFormat.NVFP4 or BlockScalingFormat.MXFP8. Internally, it is validated to be consistent with the operand dtype, and a ValueError is raised if not.

  • axis

    The blocked dimension of the operand tensor. For example, for NVFP4/MXFP8 matmul, A is blocked in rows (axis = -1), and B is blocked in columns (axis = -2). Depending on operand_or_shape:

    • if a shape is passed to operand_or_shape, then axis is required

    • if an operand is passed to operand_or_shape, then axis can be omitted and the blocked dimension is inferred from the operand’s layout.

  • out – Output tensor to copy the scales to. If None, a new tensor is created.

Returns:

Flat out tensor containing the scales copied to match cuBLAS-compatible interleaved layout. The out dtype is the same as the scale_tensor dtype.