MatmulOptions#
-
class nvmath.
linalg. advanced. MatmulOptions( - compute_type: int | None = None,
- scale_type: int | None = None,
- result_type: int | None = None,
- result_amax: bool = False,
- block_scaling: bool = False,
- sm_count_target: int | None = 0,
- fast_accumulation: bool | None = False,
- device_id: int | None = None,
- handle: int | None = None,
- logger: Logger | None = None,
- memory_limit: int | str | None = '80%',
- blocking: Literal[True, 'auto'] = 'auto',
- allocator: BaseCUDAMemoryManager | None = None,
A data class for providing options to the
Matmul
object and the wrapper functionmatmul()
.- compute_type#
CUDA compute type. A suitable compute type will be selected if not specified.
- Type:
nvmath.
linalg. ComputeType
- scale_type#
CUDA data type. A suitable data type consistent with the compute type will be selected if not specified.
- Type:
- result_type#
CUDA data type. A requested datatype of the result. If not specified, this type will be determined based on the input types. Non-default result types are only supported for narrow-precision (FP8 and lower) operations.
- Type:
- result_amax#
If set, the absolute maximum (amax) of the result will be returned in the auxiliary output tensor. Only supported for narrow-precision (FP8 and lower) operations.
- Type:
- block_scaling#
If set, block scaling (MXFP8) will be used instead of tensor-wide scaling for FP8 operations. If the result is a narrow-precision (FP8 and lower) data type, scales used for result quantization will be returned in the auxiliary output tensor as
"d_out_scale"
in UE8M0 format. For more information on UE8M0 format, see the documentation ofMatmulQuantizationScales
. This option is only supported for narrow-precision (FP8 and lower) operations.- Type:
- sm_count_target#
The number of SMs to use for execution. The default is 0, corresponding to all available SMs.
- Type:
- fast_accumulation#
Enable or disable FP8 fast accumulation mode. The default is False (disabled).
- Type:
- device_id#
CUDA device ordinal (used if the MM operands reside on the CPU). Device 0 will be used if not specified.
- Type:
int | None
- handle#
Linear algebra library handle. A handle will be created if one is not provided.
- Type:
int | None
- logger#
Python Logger object. The root logger will be used if a logger object is not provided.
- Type:
- memory_limit#
Maximum memory available to the MM operation. It can be specified as a value (with optional suffix like K[iB], M[iB], G[iB]) or as a percentage. The default is 80% of the device memory.
- blocking#
A flag specifying the behavior of the execution functions and methods, such as
matmul()
andMatmul.execute()
. Whenblocking
isTrue
, the execution methods do not return until the operation is complete. Whenblocking
is"auto"
, the methods return immediately when the inputs are on the GPU. The execution methods always block when the operands are on the CPU to ensure that the user doesn’t inadvertently use the result before it becomes available. The default is"auto"
.- Type:
Literal[True, ‘auto’]
- allocator#
An object that supports the
BaseCUDAMemoryManager
protocol, used to draw device memory. If an allocator is not provided, a memory allocator from the library package will be used (torch.cuda.caching_allocator_alloc()
for PyTorch operands,cupy.cuda.alloc()
otherwise).- Type: