get_block_scale_offset#
-
nvmath.
linalg. advanced. helpers. matmul. get_block_scale_offset( - index: tuple[int, ...] | tuple[torch.Tensor, ...],
- operand_or_shape: torch.Tensor | tuple[int, ...],
- block_scaling_format: BlockScalingFormat,
- *,
- axis: Literal[-1, -2] | None = None,
This function is experimental and potentially subject to future changes.
Computes offset of a block scale factor in the 1D interleaved scales tensor.
Matmul (cuBLAS) expects scale factors in specific interleaved layout.
This function aims to abstract away the interleaved layout details, offering indexing that more directly corresponds to the operand’s shape.
Example
Suppose that you are doing an NVFP4 matmul
a @ bwithaof shape (M=128, K=128). For matrixa, a single scale is applied to consecutive 16 elements blocks in a row (axis=-1). Therefore, to find the scale applied toa[y, x], we first need to adjust the x index to the index of the 16-element block it belongs to, which isblock_idx = x // 16. Then, calling:get_block_scale_offset((y, block_idx), a, BlockScalingFormat.NVFP4)will return the offset of the scale applied toa[y, x](and all other elements in the same 16-element block).The schematic below shows matrix
awith the 16-element blocks annotated. Asterisks mark two target blocks:elements in
aat indices from (5, 32) to (5, 47), correspond to the same block (K-group 2) and map to the same offsetget_block_scale_offset((5, 2), a, BlockScalingFormat.NVFP4) == 82elements in
aat indices from (5, 80) to (5, 95), correspond to the same block (K-group 5) and map to the same offsetget_block_scale_offset((5, 5), a, BlockScalingFormat.NVFP4) == 593
| K-grp 0 | K-grp 1 | K-grp 2 | K-grp 3 | K-grp 4 | K-grp 5 | ... | [0..15] | [16..31] | [32..47] | [48..63] | [64..79] | [80..95] | ... +----------+----------+----------+----------+----------+----------+--- row 0 | | | | | | | ... | | | | | | | row 5 | | | * | | | * | ... | | | | | | | row127| | | | | | | +----------+----------+----------+----------+----------+----------+--- (5,2) (5,5)Note
As far as computing the block scale offset, the only difference between MXFP8 and NVFP4 is the number of elements in a block (32 for MXFP8, 16 for NVFP4).
- Parameters:
index –
A tuple of indices with length equal to
len(operand_shape). Can be:A tuple of integers for single-element query, e.g.,
(10, 20)A tuple of tensors for batch query, e.g.,
(xs, ys)wherexsandysare tensors of the same shape
operand_or_shape – Operand tensor (that the scales apply to) or the operand’s logical (non-packed, non-blocked) shape.
block_scaling_format – The block scaling format of the operand:
BlockScalingFormat.NVFP4orBlockScalingFormat.MXFP8. Internally, it is validated to be consistent with the operand dtype, and aValueErroris raised if not.axis –
The blocked dimension of the operand tensor. For example, for NVFP4/MXFP8 matmul, A is blocked in rows (
axis = -1), and B is blocked in columns (axis = -2). Depending onoperand_or_shape:if a shape is passed to
operand_or_shape, thenaxisis requiredif an operand is passed to
operand_or_shape, thenaxiscan be omitted and the blocked dimension is inferred from the operand’s layout.
- Returns:
An integer (if
indexcontains integers) or a tensor of integers (ifindexcontains tensors), indicating the offset(s) to the MXFP8/NVFP4 block scale factor(s). The returned offset points to a block scale factor that is applied to:for axis == -2:
operand[*index[-2:], block_size*index[-2]:block_size*(index[-2]+1), index[-1]].for axis == -1:
operand[*index[-2:], index[-1], block_size*index[-1]:block_size*(index[-1]+1)].
where the block size is 32 for MXFP8 and 16 for NVFP4.
Note
In typical use-cases, there should be no need to manually modify MXFP8 scales. The scales returned as
"d_out_scale"by one matmul, can be directly reused as input scales for another matmul.Hint
To apply the interleaved scales (e.g. as returned by matmul’s
d_out_scale) to the operand, useapply_mxfp8_scale()instead.To specify scales as ND tensor and copy them to cuBLAS-compatible interleaved layout, use
to_block_scale()instead.