get_block_scale_offset#

nvmath.linalg.advanced.helpers.matmul.get_block_scale_offset( index: tuple[int, ...] | tuple[torch.Tensor, ...], operand_or_shape: torch.Tensor | tuple[int, ...], block_scaling_format: BlockScalingFormat, *, axis: Literal[-1, -2] | None = None, ) → int | torch.Tensor[source]#

This function is experimental and potentially subject to future changes.

Computes offset of a block scale factor in the 1D interleaved scales tensor.

Matmul (cuBLAS) expects scale factors in specific interleaved layout.

This function aims to abstract away the interleaved layout details, offering indexing that more directly corresponds to the operand’s shape.

Example

Suppose that you are doing an NVFP4 matmul a @ b with a of shape (M=128, K=128). For matrix a, a single scale is applied to consecutive 16 elements blocks in a row (axis=-1). Therefore, to find the scale applied to a[y, x], we first need to adjust the x index to the index of the 16-element block it belongs to, which is block_idx = x // 16. Then, calling: get_block_scale_offset((y, block_idx), a, BlockScalingFormat.NVFP4) will return the offset of the scale applied to a[y, x] (and all other elements in the same 16-element block).

The schematic below shows matrix a with the 16-element blocks annotated. Asterisks mark two target blocks:

elements in a at indices from (5, 32) to (5, 47), correspond to the same block (K-group 2) and map to the same offset get_block_scale_offset((5, 2), a, BlockScalingFormat.NVFP4) == 82
elements in a at indices from (5, 80) to (5, 95), correspond to the same block (K-group 5) and map to the same offset get_block_scale_offset((5, 5), a, BlockScalingFormat.NVFP4) == 593

      | K-grp 0  | K-grp 1  | K-grp 2  | K-grp 3  | K-grp 4  | K-grp 5  | ...
      | [0..15]  | [16..31] | [32..47] | [48..63] | [64..79] | [80..95] | ...
      +----------+----------+----------+----------+----------+----------+---
row 0 |          |          |          |          |          |          |
 ...  |          |          |          |          |          |          |
row 5 |          |          |    *     |          |          |    *     |
 ...  |          |          |          |          |          |          |
row127|          |          |          |          |          |          |
      +----------+----------+----------+----------+----------+----------+---
                              (5,2)                            (5,5)

Note

As far as computing the block scale offset, the only difference between MXFP8 and NVFP4 is the number of elements in a block (32 for MXFP8, 16 for NVFP4).

Parameters:

index –
A tuple of indices with length equal to len(operand_shape). Can be:
- A tuple of integers for single-element query, e.g., (10, 20)
- A tuple of tensors for batch query, e.g., (xs, ys) where xs and ys are tensors of the same shape
operand_or_shape – Operand tensor (that the scales apply to) or the operand’s logical (non-packed, non-blocked) shape.
block_scaling_format – The block scaling format of the operand: BlockScalingFormat.NVFP4 or BlockScalingFormat.MXFP8. Internally, it is validated to be consistent with the operand dtype, and a ValueError is raised if not.
axis –
The blocked dimension of the operand tensor. For example, for NVFP4/MXFP8 matmul, A is blocked in rows (axis = -1), and B is blocked in columns (axis = -2). Depending on operand_or_shape:
- if a shape is passed to operand_or_shape, then axis is required
- if an operand is passed to operand_or_shape, then axis can be omitted and the blocked dimension is inferred from the operand’s layout.

Returns:

An integer (if index contains integers) or a tensor of integers (if index contains tensors), indicating the offset(s) to the MXFP8/NVFP4 block scale factor(s). The returned offset points to a block scale factor that is applied to:

for axis == -2: operand[*index[-2:], block_size*index[-2]:block_size*(index[-2]+1), index[-1]].
for axis == -1: operand[*index[-2:], index[-1], block_size*index[-1]:block_size*(index[-1]+1)].

where the block size is 32 for MXFP8 and 16 for NVFP4.

Note

In typical use-cases, there should be no need to manually modify MXFP8 scales. The scales returned as "d_out_scale" by one matmul, can be directly reused as input scales for another matmul.

Hint

To apply the interleaved scales (e.g. as returned by matmul’s d_out_scale) to the operand, use apply_mxfp8_scale() instead.
To specify scales as ND tensor and copy them to cuBLAS-compatible interleaved layout, use to_block_scale() instead.