MatmulQuantizationScales#

class nvmath.distributed.linalg.advanced.MatmulQuantizationScales(
a: int | float | AnyTensor | None = None,
b: int | float | AnyTensor | None = None,
c: int | float | AnyTensor | None = None,
d: int | float | AnyTensor | None = None,
)[source]#

A data class for providing quantization_scales to Matmul constructor and the wrapper function matmul().

Scales can only be set for narrow-precision (FP8 and lower) matrices.

FP8 operations (block_scaling=False, per-tensor scaling):

  • Scale format: scalar (integer or float) or single-element tensor of shape () or (1,)

  • A single scale value is applied to the entire tensor.

MXFP8 operations (block_scaling=True, microscaling with FP8 operands):

  • Scale format: 1D tensor with layout matching cuBLAS MXFP8 requirements

  • Scale dtype: uint8 (interpreted as UE8M0 values by cuBLAS)

  • A scale value \(x\) causes cuBLAS to multiply the respective block by \(2^{x-127}\).

NVFP4 operations (block_scaling=True, block scaling with FP4 operands):

  • Scale format: 1D tensor with layout matching cuBLAS NVFP4 requirements

  • Scale dtype: float8_e4m3fn (interpreted by cuBLAS as unsigned UE4M3 values, i.e. the bit sign is ignored).

  • FP4 only supports block scaling; block_scaling=False is not supported for FP4.

Note

MXFP8 and NVFP4 use “block scaling” where each block of elements has its own scale factor, as opposed to FP8 which uses a single per-tensor scale. When block_scaling=True, tensor scales are required; scalar scales are not allowed.

Note

When scales are provided as tensors, they must be from the same package and on the same memory space (CPU or GPU device) as the operands of the matmul.

a#

Scale for matrix A.

Type:

int, float, or Tensor

b#

Scale for matrix B.

Type:

int, float, or Tensor

c#

Scale for matrix C.

Type:

int, float, or Tensor

d#

Scale for matrix D.

Type:

int, float, or Tensor

See also

Matmul, matmul()