nvmath.linalg.advanced.MatmulOptions

class nvmath.linalg.advanced.MatmulOptions(compute_type: int | None = None, scale_type: int | None = None, sm_count_target: int | None = 0, fast_accumulation: bool | None = False, device_id: int | None = None, handle: int | None = None, logger: Logger | None = None, memory_limit: int | str | None = '80%', blocking: Literal[True, 'auto'] = 'auto', allocator: BaseCUDAMemoryManager | None = None)[source]

A data class for providing options to the Matmul object and the wrapper function matmul().

compute_type

CUDA compute type. A suitable compute type will be selected if not specified.

Type:

nvmath.linalg.ComputeType

scale_type

CUDA data type. A suitable data type consistent with the compute type will be selected if not specified.

Type:

nvmath.CudaDataType

sm_count_target

The number of SMs to use for execution. The default is 0, corresponding to all available SMs.

Type:

int

fast_accumulation

Enable or disable FP8 fast accumulation mode. The default is False (disabled).

Type:

bool

device_id

CUDA device ordinal (used if the MM operands reside on the CPU). Device 0 will be used if not specified.

Type:

int | None

handle

Linear algebra library handle. A handle will be created if one is not provided.

Type:

int | None

logger

Python Logger object. The root logger will be used if a logger object is not provided.

Type:

logging.Logger

memory_limit

Maximum memory available to the MM operation. It can be specified as a value (with optional suffix like K[iB], M[iB], G[iB]) or as a percentage. The default is 80% of the device memory.

Type:

int | str | None

blocking

A flag specifying the behavior of the execution functions and methods, such as matmul() and Matmul.execute(). When blocking is True, the execution methods do not return until the operation is complete. When blocking is "auto", the methods return immediately when the inputs are on the GPU. The execution methods always block when the operands are on the CPU to ensure that the user doesn’t inadvertently use the result before it becomes available. The default is "auto".

Type:

Literal[True, ‘auto’]

allocator

An object that supports the BaseCUDAMemoryManager protocol, used to draw device memory. If an allocator is not provided, a memory allocator from the library package will be used (torch.cuda.caching_allocator_alloc() for PyTorch operands, cupy.cuda.alloc() otherwise).

Type:

nvmath.memory.BaseCUDAMemoryManager | None

See also

Matmul, matmul()