The nvmath.device module is experimental and potentially subject to future changes.

LQFactorize#

class nvmath.device.LQFactorize( size: Sequence[int], precision: type[floating], execution: str, *, sm=None, arrangement: str | None = None, batches_per_block: int | Literal['suggested'] | None = None, data_type: str | None = None, leading_dimension: int | None = None, block_dim: Sequence[int] | Literal['suggested'] | None = None, )[source]#

A class that encapsulates LQ orthogonal factorization device function for general matrices using Householder reflections.

Available operation:

factorize: Computes the LQ factorization A = L @ Q, where L is a lower triangular matrix (if M <= N) or lower trapezoidal matrix (if M > N) and Q is a unitary N x N matrix.

The factorization uses Householder reflection transformations and does not explicitly form the unitary matrix Q. Instead, Q is represented as a product of Householder vectors stored in the input matrix A along with the tau array.

Memory Layout Requirements:

Matrices must be stored in shared memory according to their arrangement and leading dimension (ld):

For matrix A (M x N):

Column-major arrangement: Matrix shape (batches_per_block, M, N) with strides (lda * N, 1, lda)
Row-major arrangement: Matrix shape (batches_per_block, M, N) with strides (lda * M, lda, 1)

Note

The LQ factorization is essentially equivalent to the QR factorization of A^T (or A^H for complex types).

Parameters:

size (Sequence[int]) – Problem size specified as a sequence of 1 to 3 elements: (M,) (treated as (M, M, 1)), (M, N) (treated as (M, N, 1)), or (M, N, K). M and N represent the dimensions of the matrix A used in factorization. K is ignored if specified.
precision (type[np.floating]) – The computation precision specified as a numpy float dtype. Currently supports: numpy.float32, numpy.float64.
execution (str) – A string specifying the execution method. Supported values: 'Block'.
sm (ComputeCapability) – Target mathdx compute-capability.
arrangement (str, optional) – Storage layout for matrix A. Can be one of: 'col_major', 'row_major'. Defaults to 'col_major'.
batches_per_block (int | Literal["suggested"], optional) – Number of batches to compute in parallel in a single CUDA block. Can be a positive integer or the string 'suggested' for automatic selection of an optimal value. We recommend using 1 for matrix A size larger than or equal to 16 x 16, and using 'suggested' for smaller sizes to achieve optimal performance. Defaults to 1.
data_type (str, optional) – The data type of the input matrices, can be one of: 'real', 'complex'. Defaults to 'real'.
leading_dimension (int, optional) – The leading dimension for input matrix A, or None. If not provided, it will be automatically deduced from size and arrangement. Note: When provided in the constructor, leading dimensions are set at compile-time. To use runtime leading dimensions (avoiding recompilation for different leading dimensions), provide the leading dimension parameters directly to the device methods instead.
block_dim (Sequence[int] | Literal["suggested"], optional) – The block dimension for launching the CUDA kernel, specified as a 1 to 3 integer sequence (x, y, z) where missing dimensions are assumed to be 1. Can be a sequence of 1 to 3 positive integers, the string 'suggested' for optimal value selection, or None for the default value.